[HN Gopher] Gemini 2.5 Pro Preview
       ___________________________________________________________________
        
       Gemini 2.5 Pro Preview
        
       Author : meetpateltech
       Score  : 434 points
       Date   : 2025-05-06 15:10 UTC (7 hours ago)
        
 (HTM) web link (developers.googleblog.com)
 (TXT) w3m dump (developers.googleblog.com)
        
       | jeswin wrote:
       | Now if there was a way to add prepaid credits and monitor usage
       | near real-time on a dashboard, like every other vendor. Hey
       | Google are you listening?
        
         | greenavocado wrote:
         | You can do that by using deepinfra to manage your billing. It's
         | pay-as-you-go and they have a pass-through virtual target for
         | Google Gemini.
         | 
         | Deepinfra token usage updates every time you switch to the tab
         | if it is opened to the usage page so it is possible to see
         | updates even every second
        
         | Hawkenfall wrote:
         | You can do this with https://openrouter.ai/
        
           | pzo wrote:
           | but if you want to use google SDK (python-genai, js-genai)
           | rather than openai SDK (If found google api more feature rich
           | when using different modality like audio/images/video) you
           | cannot use openrouter. Also not sure if you are developing
           | app and needs higher rate limits - what's typical rate limit
           | via openrouter?
        
             | pzo wrote:
             | also for some reason I tested simple prompt (few words, no
             | system prompt) with attached 1 images and openrouter
             | charged me like ~1700 tokens when on the other hand using
             | directly via python-genai its like ~400 tokens. Also keep
             | in mind they charge small markup fee when you top you their
             | account.
        
         | slig wrote:
         | In in the meantime, I'm using openrouter.
        
         | cchance wrote:
         | openrouter, i dont think anyone should use google direct till
         | they fix their shit billing
        
           | greenavocado wrote:
           | Even afterwards. Avoid paying directly if you can because
           | they generally could not care less about individuals.
           | 
           | You have less than $10 million in spend you will be treated
           | worse than cattle because at least farmers feed their cattle
           | before they are milked
        
         | tucnak wrote:
         | You need LLM Ops. YC happens to have invested in Langfuse,
         | which is if you're serious about tracking metrics, you'll
         | appreciate the rest, too.
         | 
         | And before you ask: yes, for cached content and batch
         | completion discounts you can accommodate both--just needs a bit
         | of logic in your completion-layer code.
        
         | therealmarv wrote:
         | Is this on Google AI Studio or Google Vertex or both?
        
         | simple10 wrote:
         | You can do this with LLM proxies like LiteLLM. e.g. Cursor ->
         | LiteLLM -> LLM provider API.
         | 
         | I have LiteLLM server running locally with Langfuse to view
         | traces. You configure LiteLLM to connect directly to providers'
         | APIs. This has the added benefit of being able to create
         | LiteLLM API keys per project that proxies to different sets of
         | provider API keys to monitor or cap billing usage.
         | 
         | I use https://github.com/LLemonStack/llemonstack/ to spin up
         | local instances of LiteLLM and Langfuse.
        
       | ramesh31 wrote:
       | >Best-in-class frontend web development
       | 
       | It really is wild to have seen this happen over the last year.
       | The days of traditional "design-to-code" FE work are completely
       | over. I haven't written a line of HTML/CSS in months. If you are
       | still doing this stuff by hand, you need to adapt fast. In
       | conjunction with an agentic coding IDE and a few MCP tools, weeks
       | worth of UI work are now done in hours to a _higher_ level of
       | quality and consistency with practically zero effort.
        
         | shostack wrote:
         | What does your tool and model stack look like for this?
        
           | ramesh31 wrote:
           | Cline with Gemini 2.5 (https://cline.bot/)
           | 
           | Framelink MCP (https://github.com/GLips/Figma-Context-MCP)
           | 
           | Playwright MCP (https://github.com/microsoft/playwright-mcp)
           | 
           | Pull down designs via Framelink, optionally enrich with PNG
           | exports of nodes added as image uploads to the prompt, write
           | out the components, test/verify via Playwright MCP.
           | 
           | Gemini has a 1M context size now, so this applies to large
           | mature codebases as well as greenfield. The key thing here is
           | the coding agent being really clever about maintaining its'
           | context; you don't need to fit an entire codebase into a
           | single prompt in the same way that you don't need to fit the
           | entire codebase into your head to make a change, you just
           | need enough context on the structure and form to maintain the
           | correct patterns.
        
             | jjani wrote:
             | The designs itself are still done by humans, I presume?
        
               | ramesh31 wrote:
               | >The designs itself are still done by humans, I presume?
               | 
               | Indeed, in fact design has become the bottleneck now.
               | Figma has really dropped the ball here WRT building out
               | AI assisted (not _driven_ ) tooling for designers.
        
         | redox99 wrote:
         | What tools do you use?
        
         | amarcheschi wrote:
         | i'm surprised by no line of css html in months. maybe it's an
         | exageration and that's okay.
         | 
         | However, just today i was building a website for fun with
         | gemini and had to manually fix some issues with css that he
         | struggled with. as it often happens, trying to let it repair
         | the damage only made it go into a pit of despair (for me). i
         | fixed the issues in about a glance and 5 minutes. This is not
         | to say it's bad, but sometimes it still makes absurd mistakes
         | and can't find a way to solve them
        
           | PaulHoule wrote:
           | I have pretty good luck with AI assistants with CSS and with
           | theming React components like MUI where you have to figure
           | out what to put in an sx or a theme. Sure beats looking
           | through 50 standards documents (fortunately not a lot of
           | "document A invalidates document B" in that pile) or digging
           | through wrong answers where ignoramuses hold court on
           | StackOverflow.
        
           | ramesh31 wrote:
           | >"just today i was building a website for fun with gemini and
           | had to manually fix some issues with css that he struggled
           | with."
           | 
           | Tailwind (with utility classes) is the real key here. It
           | provides a semantic layer over CSS that allows the LLM to
           | reason about how things will actually look. Night and day
           | difference from using stylesheets with custom classes.
        
         | dlojudice wrote:
         | > are now done in hours to a higher level of quality
         | 
         | However, I feel that there is a big difference between the
         | models. In my tests, using Cursor, Clause 3.7 Sonnet has a much
         | more refined "aesthetic sense" than other models. Many times I
         | ask "make it more beautiful" and it manages to improve, where
         | other models just can't understand it.
        
           | danielbln wrote:
           | I've noticed the same, but I wonder if this new Gemini
           | checkpoint is better at it now.
        
         | preommr wrote:
         | Elaborate, because I have serious doubts about this.
         | 
         | If we're talking about just slapping on tailwind+component-
         | library(e.g. shadcn-ui, material), then that's just one step-
         | above using no-code solutions. Which, yes, that works well. But
         | if someone didn't need customized logic, then it was always
         | possible to just hop on fiverr or use some very simple
         | template-based tools to accomplish this.
         | 
         | If we're talking more advanced logic, understanding aesthetics,
         | etc. Then I'd say it's much worse than other coding areas like
         | backend, because they work on a visual and ux level beyond just
         | code which is just text manipulation (and what llms excel at).
         | In other words, I think the results are still very shallow
         | beyond first impressions.
        
         | kweingar wrote:
         | If it's zero effort, then why do devs need to adapt fast? And
         | wouldn't adapting be incredibly easy?
         | 
         | The only disadvantage to not using these tools would be that
         | your current output is slower. As soon as your employer asks
         | for more or you're looking for a new job, you can just turn on
         | AI and be as fast as everyone who already uses it.
        
           | Workaccount2 wrote:
           | "Why are we paying you $150k/yr to middleman a chatbot?"
        
             | ramesh31 wrote:
             | >"Why are we paying you $150k/yr to middleman a chatbot?"
             | 
             | Because I don't get paid $150k/yr to write HTML and CSS. I
             | get paid to provide technical solutions to business
             | problems. And "chatbots" are a very useful new tool to aid
             | in that.
        
               | kweingar wrote:
               | > I get paid to provide technical solutions to business
               | problems.
               | 
               | That's true of all SWEs who write HTML and CSS, and it's
               | the reason I don't think there's much downside for devs
               | to not proactively start using these agentic tools.
               | 
               | If it truly turns weeks of work into hours as you say,
               | then my managers will start asking me to use them, and I
               | will use them. I won't be at a disadvantage compared to
               | people who started using them a bit earlier than me.
               | 
               | If I am looking for a new job and find an employer that
               | wants people to use agentic tools, then I will tell the
               | hiring manager that I will use those tools. Again, no
               | disadvantage.
               | 
               | Being outdated as a tech employee puts you at a
               | disadvantage to the extent that there is a difficult-to-
               | cross gap. If you are working in COBOL and the market
               | demands Rust engineers, then you need a significant
               | amount of learning/experience to catch up.
               | 
               | But a major pitch of AI tools is that it is not difficult
               | to cross the gap. You draw on your domain experience to
               | describe what you want, and it gives it to you. When it
               | makes a mistake, you draw on your domain experience to
               | tweak or fix things as needed.
               | 
               | Maybe someday there will be a gap. Maybe people will
               | develop years of experience and intuition using
               | particular AI tools that makes them much more attractive
               | than somebody without this experience. But the tools are
               | churning so quickly (Claude Code and Cursor are brand
               | new, tools from 18 months ago are obsolete, newer and
               | better tools are surely coming soon) that this seems far
               | off.
        
           | jaccola wrote:
           | Yup, I see comments like the parent all of the time and they
           | are always a head scratcher. They would be far more rational
           | (and a bit desperate) if they were trying to sell something,
           | but they never appear to be.
           | 
           | Always "10x"/"100x" more productive with AI, "you will miss
           | out if you don't adopt now"! Build a great company 100x
           | faster and every rational actor in the market will notice,
           | believe you and be begging to adopt your ways of working (and
           | you will become filthy rich as a nice kicker).
           | 
           | The proof of the pudding is in the eating.
        
         | mediaman wrote:
         | I find they achieve acceptable, but not polished levels of
         | work.
         | 
         | I'm not even a designer, but I care about the consistency of UI
         | design and whether the overall experience is well-organized,
         | aligned properly, things are placed in a logical flow for the
         | user, and so on.
         | 
         | While I'm pro-AI tooling and use it heavily, and these models
         | usually provide a good starting point, I can't imagine shipping
         | the slop without writing/editing a line of HTML for anything
         | that's interaction-heavy.
        
       | siwakotisaurav wrote:
       | Usually don't believe the benchmarks but first in web dev arena
       | specifically is crazy. That one has been Claude for so long,
       | which tracks in my experience
        
         | hersko wrote:
         | Give Gemini a shot. It is genuinely very good.
        
         | enraged_camel wrote:
         | I'm wondering when Claude 4 will drop. It's long overdue.
        
           | danielbln wrote:
           | I was a little disappointed when the last thing coming out of
           | Anthropic was their MAX pricing plan instead of a better
           | model...
        
           | Etheryte wrote:
           | For me, Claude 3.7 was a noticeable step down across a wide
           | range of tasks when compared to 3.5 with the same prompt.
           | Benchmarks are one thing, but for real life use, I kept
           | finding myself switching back to 3.5. Wouldn't be surprised
           | if they were trying to figure out what happened there and how
           | to prevent that in the next version.
        
       | ranyume wrote:
       | I don't know if I'm doing something wrong, but every time I ask
       | gemini 2.5 for code it outputs SO MANY comments. An exaggerated
       | amount of comments. Sections comments, step comments, block
       | comments, inline comments, all the gang.
        
         | cchance wrote:
         | And comments are bad? I mean you could tell it to not comment
         | the code or to self-document with naming instead of inline
         | comments, its a LLM it does what you tell it to
        
         | tucnak wrote:
         | Ask it to do less of it, problem solved, no? With tools like
         | Cursor it's become really easy to fit the models to the shoe,
         | or the shoe to the foot.
        
         | GaggiX wrote:
         | You can ask to not use comments or use less comments, you can
         | put this in the system prompt too.
        
           | ChadMoran wrote:
           | I've tried this, aggressively and it still does it for me. I
           | gave up.
        
             | ziml77 wrote:
             | I tried this as well. I'm interfacing with Gemini 2.5 using
             | Cursor and I have rules to to limit the comments. It still
             | ends up over-commenting.
        
               | shawabawa3 wrote:
               | I have a feeling this may be a cursor issue, perhaps
               | cursors system prompt asks for comments? Asking in the
               | aistudio UI for code and ending the prompt with "no code
               | comments" has always worked for me
        
             | koakuma-chan wrote:
             | Have you tried threats?
        
               | throwup238 wrote:
               | It strips the comments from the code or else it gets the
               | hose again.
        
           | blensor wrote:
           | Maybe too many comments could be a good metric to check if
           | someone just yolo accepted the result or if they actually
           | checked if it's correct.
           | 
           | I don't have problems with getting lot's of comments in the
           | output, I am just deleting it while reading what it did
        
             | tough wrote:
             | another great tell of code reviewers yolo'ing it is that
             | LLM's usually put the full filename path on the output, so
             | if you see a file with the filename / path on the first
             | line, thats prob a llm output
        
         | Scene_Cast2 wrote:
         | It also does super defensive coding. Not that it's a bad thing
         | in general, but I write a lot of prototype code.
        
           | prpl wrote:
           | Production quality code is defensive. Probably trained on a
           | lot of google code.
        
             | montebicyclelo wrote:
             | Does the code consist of many large try except blocks that
             | catch "Exception", which Gemini seems to like doing, (I
             | thought it was a bad practice to catch the generic
             | Exception in Python)
        
               | hnuser123456 wrote:
               | Catching the generic exception is a nice middleground
               | between not catching exceptions at all (and letting your
               | script crash), and catching every conceivable exception
               | individually and deciding exactly how to handle each one.
               | Depends on how reliable you need your code to be.
        
               | montebicyclelo wrote:
               | Hmm, for my use case just allowing the lines to fail
               | would have been better, (which I told the model)
        
             | Tainnor wrote:
             | Depends on what you mean by "defensive". Anticipating error
             | and non-happy-path cases and handling them is definitely
             | good. Also fault tolerance, i.e. allowing parts of the
             | application to fail without bringing down everything.
             | 
             | But I've heard "defensive code" used for the kind of code
             | where almost every method validates its input parameters,
             | wraps everything in a try-catch, returns nonsensical
             | default values in failure scenarios, etc. This is a
             | complete waste because the caller won't know what to do
             | with the failed validations or thrown errors, and it's just
             | unnecessary bloat that obfuscates the business logic.
             | Validation, error handling and so on should be done in
             | specific parts of the codebase (bonus points if you can
             | encode the successful validation or the presence/absence of
             | errors in the type system).
        
               | neilellis wrote:
               | this!
               | 
               | lots of hasattr("") rubbish, I've increased the amount of
               | prompting but it still does this - basically it defers
               | it's lack of compile time knowledge to runtime 'let's
               | hope for the best, and see what happens!'
               | 
               | Trying to teach it FAIL FAST is an uphill struggle.
               | 
               | Oh and yes, returning mock objects if something goes
               | wrong is a favourite.
               | 
               | It truly is an Idiot Savant - but still amazingly
               | productive.
        
         | Benjammer wrote:
         | I've found that heavily commented code can be better for the
         | LLM to read later, so it pulls in explanatory comments into
         | context at the same time as reading code, similar to pulling in
         | @docs, so maybe it's doing that on purpose?
        
           | koakuma-chan wrote:
           | No, it's just bad. I've been writing a lot of Python code
           | past two days with Gemini 2.5 Pro Preview, and all of its
           | code was like:
           | 
           | ```python
           | 
           | def whatever():                 --- SECTION ONE OF THE CODE
           | ---            ...            --- SECTION TWO OF THE CODE ---
           | try:         [some "dangerous" code]       except Exception
           | as e:          logging.error(f"Failed to save files to
           | {output_path}: {e}")          # Decide whether to raise the
           | error or just warn          # raise IOError(f"Failed to save
           | files to {output_path}: {e}")
           | 
           | ```
           | 
           | (it adds commented out code like that all the time, "just in
           | case")
           | 
           | It's terrible.
           | 
           | I'm back to Claude Code.
        
             | brandall10 wrote:
             | It's certainly annoying, but you can try following up with
             | "can you please remove superfluous comments? In particular,
             | if a comment doesn't add anything to the understanding of
             | the code, it doesn't deserve to be there".
        
               | diggan wrote:
               | I'm having the same issue, and no matter what I prompt
               | (even stuff like "Don't add any comments at all to
               | anything, at any time") it still tries to add these
               | typical junior-dev comments where it's just re-iterating
               | what the code on the next line does.
        
               | tough wrote:
               | you can have a script that drops them all
        
               | shawabawa3 wrote:
               | You don't need a follow up
               | 
               | Just end your prompt with "no code comments"
        
               | brandall10 wrote:
               | I prefer not to do that as comments are helpful to guide
               | the LLM, and esp. show past decisions so it doesn't redo
               | things, at least in the scope of a feature. For me this
               | tends to be more of a final refactoring step to tidy them
               | up.
        
             | NeutralForest wrote:
             | I'm seeing it trying to catch blind exceptions in Python
             | all the time. I see it in my colleagues code all the time,
             | it's driving me nuts.
        
               | jerkstate wrote:
               | There are a bunch of stupid behaviors of LLM coding that
               | will be fixed by more awareness pretty soon. Imagine
               | putting the docs and code for all of your libraries into
               | the context window so it can understand what exceptions
               | might be thrown!
        
               | maccard wrote:
               | Copilot and the likes have been around for 4 years, and
               | we've been hearing this all along. I'm bullish on LLM
               | assistants (not vibe coding) but I'd love to see some of
               | these things actually start to happen.
        
               | kenjackson wrote:
               | I feel like it has gotten better over time, but I don't
               | have any metrics to confirm this. And it may also depend
               | on what type of you language/libraries that you use.
        
               | maccard wrote:
               | I feel like there was a huge jump when cursor et al
               | appeared, and things have been "changing" since then
               | rather than improving.
        
               | NeutralForest wrote:
               | It just feels to me like trying to derive correct
               | behavior without a proper spec so I don't see how it'll
               | get that much better. Maybe we'll collectively remove the
               | pathological code but otherwise I'm not seeing it.
        
               | tclancy wrote:
               | Well, at least now we know who to blame for the training
               | data :)
        
               | JoshuaDavid wrote:
               | The training loop asked the model to one-shot working
               | code for the given problems without being able to
               | iterate. If you had to write code that _had_ to work on
               | the first try, and where a partially correct answer was
               | better than complete failure, I bet your code would look
               | like that too.
               | 
               | In any case, it knows what good code looks like. You can
               | say "take this code and remove spurious comments and
               | prefer narrow exception handling over catch-all", and
               | it'll do just fine (in a way it _wouldn 't_ do just fine
               | if your prompt told it to write it that way the first
               | time, writing new code and editing existing code are
               | different tasks).
        
               | NeutralForest wrote:
               | It's only an example, there's pretty of irrelevant stuff
               | that LLMs default to which is pretty bad Python. I'm not
               | saying it's always bad but there's a ton of not so nice
               | code or subtly wrong code generated (for example file and
               | path manipulation).
        
           | breppp wrote:
           | I always thought these were there to ground the LLM on the
           | task and produce better code, an artifact of the fact that
           | this will autocomplete better based on past tokens. Similarly
           | always thought this is why ChatGPT always starts every reply
           | with repeating exactly what you asked again
        
             | rst wrote:
             | Comments describing the organization and intent, perhaps.
             | Comments just saying what a "require ..." line requires,
             | not so much. (I find it will frequently put notes on the
             | change it is making in comments, contrasting it with the
             | previous state of the code; these aren't helpful at all to
             | anyone doing further work on the result, and I wound up
             | trimming a lot of them off by hand.)
        
         | guestbest wrote:
         | What kind of problems are you putting in where that is the
         | solution? Just curious.
        
         | dyauspitr wrote:
         | Just ask it for fewer comments, it's not rocket science.
        
         | Maxatar wrote:
         | Tell it not to write so many comments then. You have a great
         | deal of flexibility in dictating the coding style and can even
         | include that style in your system prompt or upload a coding
         | style document and have Gemini use it.
        
           | Trasmatta wrote:
           | Every time I ask an LLM to not write comments, it still
           | litters it with comments. Is Gemini better about that?
        
             | dheera wrote:
             | I usually ask ChatGPT to "comment the shit out of this" for
             | everything it writes. I find it vastly helps future LLM
             | conversations pick up all of the context and why various
             | pieces of code are there.
             | 
             | If it is ingesting data, there should also be a sample of
             | the data in a comment.
        
             | sitkack wrote:
             | LLMs are extremely poor at following negative instructions,
             | tell them what to do, not what not to do.
        
               | diggan wrote:
               | Ok, so saying "Implement feature X" leads to a ton of
               | comments. How do you rewrite that comment to not include
               | "don't write comments" while making the output not
               | containing comments? "Write only source code, no plain
               | text with special characters in the beginning of the
               | line" or what are you suggesting here in practical terms?
        
               | FireBeyond wrote:
               | "Implement feature X, and as you do, insert only minimal
               | and absolutely necessary comments that explain why
               | something is being done, not what is being done."
        
               | sitkack wrote:
               | You would say "omit the how". That word has negation
               | built in.
        
               | sroussey wrote:
               | "Constrain all comments to a single block at the top of
               | the file. Be concise."
               | 
               | Or something similar that does not rely on negation.
        
               | sitkack wrote:
               | I also include something about "Target the comments
               | towards a staff engineer that favors concise comments
               | that focus on the why, and only for code that might cause
               | confusion."
               | 
               | I also try and get it to channel that energy into the doc
               | strings, so it isn't buried in the source.
        
               | diggan wrote:
               | But I want no comments whatsoever, not one huge block of
               | comments at the top of the file. How'd I get that without
               | negation?
               | 
               | Besides, other models seems to handle negation correctly,
               | not sure why it's so difficult for the Gemini family of
               | models to understand.
        
               | staticman2 wrote:
               | This is sort of LLM specific. For some tasks you might
               | try including the word comment but give the order at the
               | beginning and end of the prompt. This is very model
               | dependent. Like:
               | 
               | Refractor this. Do not write any comments.
               | 
               | <code to refractor>
               | 
               | As a reminder your task is to refractor the above code
               | and do not write any comments.
        
               | diggan wrote:
               | > Do not write any comments. [...] do not write any
               | comments.
               | 
               | Literally both of those are negations.
        
               | staticman2 wrote:
               | Yes my suggestion is that negations can work just fine,
               | depending on the model and task, and instead of avoiding
               | negations you can try other promoting strategies like
               | emphasizing what you want at the beginning and at the end
               | of the prompt.
               | 
               | If you think negations never work tell Gemini 2.5 to
               | "write 10 sentences that do not include the word the" and
               | see what happens.
        
               | Hackbraten wrote:
               | "Whenever you are tempted to write a line or block
               | comment, it is imperative that you just write the actual
               | code instead"
        
             | grw_ wrote:
             | No, you can tell it not to write these comments in every
             | prompt and it'll still do it
        
             | nearbuy wrote:
             | Sample size of one, but I just tried it and it worked for
             | me on 2.5 pro. I just ended my prompt with "Do not include
             | any comments whatsoever."
        
         | taf2 wrote:
         | I really liked the Gemini 2.5 pro model when it was first
         | released - the upload code folder was very nice (but they
         | removed it). The annoying things I find with the model is it
         | does a really bad job of formatting the code it generates... I
         | know I can use a code formatting tool and I do when i use
         | gemini output but otherwise I find grok much easier to work
         | with and yields better results.
        
           | throwup238 wrote:
           | _> I really liked the Gemini 2.5 pro model when it was first
           | released - the upload code folder was very nice (but they
           | removed it)._
           | 
           | Removed from where? I use the attach code folder feature
           | every day from the Gemini web app (with a script that clones
           | a local repo that deletes .git and anything matching a
           | gitignore pattern).
        
         | puika wrote:
         | I have the same issue plus unnecessary refactorings (that break
         | functionality). it doesn't matter if I write a whole paragraph
         | in the chat or the prompt explaining I don't want it to change
         | anything else apart from what is required to fulfill my very
         | specific request. It will just go rogue and massacre the
         | entirety of the file.
        
           | fkyoureadthedoc wrote:
           | Where/how do you use it? I've only tried this model through
           | GitHub Copilot in VS Code and I haven't experienced much
           | changing of random things.
        
             | diggan wrote:
             | I've used it via Google's own AI studio and via my own
             | library/program using the API and finally via Aider. All of
             | them lead to the same outcome, large chunks of changes to a
             | lot of unrelated things ("helpful" refactors that I didn't
             | ask for) and tons of unnecessary comments everywhere (like
             | those comments you ask junior devs to stop making). No
             | amount of prompting seems to address either problems.
        
           | dherikb wrote:
           | I have the exactly same issue using it with Aider.
        
           | mgw wrote:
           | This has also been my biggest gripe with Gemini 2.5 Pro.
           | While it is fantastic at one-shotting major new features,
           | when wanting to make smaller iterative changes, it always
           | does big refactors at the same time. I haven't found a way to
           | change that behavior through changes in my prompts.
           | 
           | Claude 3.7 Sonnet is much more restrained and does smaller
           | changes.
        
             | cryptoz wrote:
             | This exact problem is something I'm hoping to fix with a
             | tool that parses the source to AST and then has the LLM
             | write code to modify the AST (which you then run to get
             | your changes) rather than output code directly.
             | 
             | I've started in a narrow niche of python/flask webapps and
             | constrained to that stack for now, but if you're interested
             | I've just opened it for signups:
             | https://codeplusequalsai.com
             | 
             | Would love feedback! Especially if you see promising
             | results in not getting huge refactors out of small change
             | requests!
             | 
             | (Edit: I also blogged about how the AST idea works in case
             | you're just that curious: https://codeplusequalsai.com/stat
             | ic/blog/prompting_llms_to_m...)
        
               | jtwaleson wrote:
               | Having the LLM modify the AST seems like a great idea.
               | Constraining an LLM to only generate valid code would be
               | super interesting too. Hope this works out!
        
               | HenriNext wrote:
               | Interesting idea. But LLMs are trained on vast amount of
               | "code as text" and tiny fraction of "code as AST";
               | wouldn't that significantly hurt the result quality?
        
               | cryptoz wrote:
               | Thanks and yeah that is a concern; however I have been
               | getting quite good results from this AST approach, at
               | least for building medium-complexity webapps. On the
               | other hand though, this wasn't always true...the only
               | OpenAI model that really works well is o3 series. Older
               | models do write AST code but fail to do a _good job_
               | because of the exact issue you mention, I suspect!
        
               | tough wrote:
               | Interesting, i started playing with ts-morph and neo4j to
               | parse TypeScript codebases.
               | 
               | simonw has symbex which could be useful for you for
               | python
        
             | nolist_policy wrote:
             | Can't you just commit the relevant parts? The git index is
             | made for this sort of thing.
        
               | tasuki wrote:
               | It's not always trivial to find the relevant 5 line
               | change in a diff of 200 lines...
        
             | fwip wrote:
             | Really? I haven't tried Gemini 2.5 yet, but my main
             | complaint with Claude 3.7 is this exact behavior - creating
             | 200+ line diffs when I asked it to fix one function.
        
           | bugglebeetle wrote:
           | This is generally controllable with prompting. I usually
           | include something like, "be excessively cautious and
           | conservative in refactoring, only implementing the desired
           | changes" to avoid.
        
         | energy123 wrote:
         | It probably increases scores in the RL training since it's a
         | kind of locally specific reasoning that would reduce bugs.
         | 
         | Which means if you try to force it to stop, the code quality
         | will drop.
        
         | bugglebeetle wrote:
         | It's annoying, but I've done extensive work with this model and
         | leaving the comments in for the first few iterations produced
         | better outcomes. I expect this is baked into the RL they're
         | doing, but because of the context size, it's not really an
         | issue. You can just ask it to strip out in the final pass.
        
         | lukeschlather wrote:
         | I usually remove the comments by hand. It's actually pretty
         | helpful, it ensures I've reviewed every piece of code
         | carefully, especially since most of the comments are literally
         | just restating the next line, and "does this comment add any
         | information?" is a really helpful question to make sure I
         | understand the code.
        
           | tasuki wrote:
           | Same! It eases my code review. In the rare occasions I don't
           | want to do that, I ask the LLM to provide the code without
           | comments.
        
         | mrinterweb wrote:
         | If you don't want so many comments, have you tried asking the
         | AI for fewer comments. Seems like something a little prompt
         | engineering could solve.
        
         | asadm wrote:
         | you need to do a 2nd step as a post-process to erase the
         | comments.
         | 
         | Models use comments to think, asking to remove will affect code
         | quality.
        
         | merksittich wrote:
         | My favourites are comments such as: from openai import
         | DefaultHttpxClient # Import the httpx client
        
         | benbristow wrote:
         | You can ask it to remove the comments afterwards, and it'll do
         | a decent job of it, but yeah, it's a pain.
        
         | sureIy wrote:
         | My custom default Claude prompt asks it to never explain code
         | unless specifically asked to. Also to produce modern and
         | compact code. It's a beauty to see. You ask for code and you
         | get code, nothing else.
        
         | Workaccount2 wrote:
         | I have a strong sense that the comments are for the model more
         | than the user. It's effectively more thinking in context.
        
         | HenriNext wrote:
         | Same experience. Especially the "step" comments about the
         | performed changes are super annoying. Here is my prompt-rule to
         | prevent them:
         | 
         | "5. You must never output any comments about the progress or
         | type of changes of your refactoring or generation. Example: you
         | must NOT add comments like: 'Added dependency' or 'Changed to
         | new style' or worst of all 'Keeping existing implementation'."
        
         | kurtis_reed wrote:
         | > all the gang
         | 
         | What does that mean?
        
         | Semaphor wrote:
         | 2.5 was the most impressive model I use, but I agree about the
         | comments. And when refactoring some code it wrote before, it
         | just adds more comments, it becomes like archaeological history
         | (disclaimer: I don't use it for work, but to see what it can
         | do, so I try to intervene as little as possible, and get it to
         | refactor what it _thinks_ it should)
        
         | Hikikomori wrote:
         | So many comments, more verbose code and will refactor stuff on
         | its own. Still better than chatgpt, but I just want a small
         | amount of code that does what I asked for so I can read through
         | it quickly.
        
         | freddydumont wrote:
         | That's been my experience as well. It's especially jarring when
         | asking for a refactor as it will leave a bunch of WIP-style
         | comments highlighting the difference with the previous
         | approach.
        
         | AuthConnectFail wrote:
         | you can ask it to remove, it does p good job at it
        
       | segphault wrote:
       | My frustration with using these models for programming in the
       | past has largely been around their tendency to hallucinate APIs
       | that simply don't exist. The Gemini 2.5 models, both pro and
       | flash, seem significantly less susceptible to this than any other
       | model I've tried.
       | 
       | There are still significant limitations, no amount of prompting
       | will get current models to approach abstraction and architecture
       | the way a person does. But I'm finding that these Gemini models
       | are finally able to replace searches and stackoverflow for a lot
       | of my day-to-day programming.
        
         | thefourthchime wrote:
         | Ask the models that can search to double check their API usage.
         | This can just be part of a pre-prompt.
        
         | ksec wrote:
         | I have been asking if AI without hallucination, coding or not
         | is possible but so far with no real concrete answer.
        
           | mattlondon wrote:
           | It's already much improved on the early days.
           | 
           | But I wonder when we'll be happy? Do we expect colleagues
           | friends and family to be 100% laser-accurate 100% of the
           | time? I'd wager we don't. Should we expect that from an
           | artificial intelligence too?
        
             | cinntaile wrote:
             | It's tool not a human so I don't know if the comparison
             | even makes sense?
        
             | ziml77 wrote:
             | Yes we should expect better from an AI that has a knowledge
             | base much larger than any individual and which can very
             | quickly find and consume documentation. I also expect them
             | to not get stuck trying the same thing they've already been
             | told doesn't work, same as I would expect from a person.
        
             | kweingar wrote:
             | I expect my calculator to be 100% accurate 100% of the
             | time. I have slightly more tolerance for other software
             | having defects, but not much more.
        
               | asadotzler wrote:
               | And a $2.99 drugstore slim wallet calculator with solar
               | power gets it right 100% of the time while billion dollar
               | LLMs can still get arithmetic wrong on occasion.
        
               | pb7 wrote:
               | My hammer can't do any arithmetic at all, why does anyone
               | even use them?
        
               | namaria wrote:
               | Does it sometimes instead of driving a nail hit random
               | things in the house?
        
               | hn_go_brrrrr wrote:
               | Yes, like my thumb.
        
               | izacus wrote:
               | What you're being asked is to stop trying to hammer every
               | single thing that comes into your vicinity. Smashing your
               | computer with a hammer won't create code.
        
               | Analemma_ wrote:
               | I don't think that's the relevant comparison though. Do
               | you expect StackOverflow or product documentation to be
               | 100% accurate 100% of the time? I definitely don't.
        
               | ctxc wrote:
               | The error introduced by the data is expected and
               | internalized, it's the error of LLMs on _top_ of that
               | that's hard to.
        
               | ctxc wrote:
               | Also, documentation and SO are incorrect in a predictable
               | way. We don't expect them to state things in a matter of
               | fact way that just don't exist.
        
               | kweingar wrote:
               | I actually agree with this. I use LLMs often, and I don't
               | compare them to a calculator.
               | 
               | Mainly I meant to push back against the reflexive
               | comparison to a friend or family member or colleague. AI
               | is a multi-purpose tool that is used for many different
               | kinds of tasks. Some of these tasks are analogues to
               | human tasks, where we should anticipate human error.
               | Others are not, and yet we often ask an LLM to do them
               | anyway.
        
               | pizza wrote:
               | Are you sure about that? Try these..
               | 
               | - (1e(1e10) + 1) - 1e(1e10)
               | 
               | - sqrt(sqrt(2)) * sqrt(sqrt(2)) * sqrt(sqrt(2)) *
               | sqrt(sqrt(2))
        
               | ctxc wrote:
               | Three decades and I haven't had to do anything remotely
               | resembling this on a calculator, much less find the
               | calculator wrong. Same for the majority of general
               | population I assume.
        
               | jjmarr wrote:
               | (1/3)*3
        
               | tasuki wrote:
               | The person you're replying to pointed out that you
               | shouldn't expect a calculator to be 100% accurate 100% of
               | the time. _Especially_ not when faced with adversarial
               | prompts.
        
               | Vvector wrote:
               | Try "1/3". The calculator answer is not "100% accurate"
        
               | bb88 wrote:
               | I had a casio calculator back in the 1980's that did
               | fractions.
               | 
               | So when I punched in 1/3 it was exactly 1/3.
        
               | gilbetron wrote:
               | It's your option not to use it. However, this is a
               | competitive environment and so we will see who pulls out
               | ahead, those that use AI as a productivity multiplier
               | versus those that do not. Maybe that multiplier is less
               | than 1, time will tell.
        
               | kweingar wrote:
               | Agreed. The nice thing is that I am told by HN and
               | Twitter that agentic workflows makes code tasks very
               | easy, so if it turns out that using these tools
               | multiplies productivity, then I can just start using them
               | and it will be easy. Then I am caught up with the early
               | adopters and don't need to worry about being out-competed
               | by them.
        
               | mattlondon wrote:
               | AIs aren't intended to be used as calculators though?
               | 
               | You could say that when I use my spanner/wrench to
               | tighten a nut it works 100% of the time, but as soon as I
               | try to use a screwdriver it's terrible and full of
               | problems and it can't even reliably so something as
               | trivially easy as tighten a nut, even though a
               | screwdriver works the same way by using torque to tighten
               | a fastener.
               | 
               | Well that's because one tool is designed for one thing,
               | and one is designed for another.
        
               | mdp2021 wrote:
               | > _AIs are_
               | 
               | "AI"s are designed to be reliable; "AGI"s are designed to
               | be intelligent; "LLM"s seem to be designed to make some
               | qualities emerge.
               | 
               | > _one tool is designed for one thing, and one is
               | designed for another_
               | 
               | The design of LLMs seems to be "let us see where the
               | promise leads us". That is not really "design", i.e.
               | "from need to solution".
        
               | thewebguyd wrote:
               | > AIs aren't intended to be used as calculators though?
               | 
               | Then why are we using them to write code, which should
               | produce reliable outputs for a given input...much like a
               | calculator.
               | 
               | Obviously we want the code to produce correct results for
               | whatever input we give, and as it stands now, I can't
               | trust LLM output without reviewing first. Still a helpful
               | tool, but ultimately my desire would be to have them be
               | as accurate as a calculator so they can be trusted enough
               | to not need the review step.
               | 
               | Using an LLM and being OK with untrustworthy results,
               | it'd be like clicking the terminal icon on my dock and
               | sometimes it opens terminal, sometimes it might open a
               | browser, or just silently fail because there's no
               | reproducible output for any given input to an LLM. To me
               | that's a problem, output should be reproducible,
               | especially if it's writing code.
        
               | LordDragonfang wrote:
               | A calculator isn't software, it's hardware. Your inputs
               | into a calculator _are_ code.
               | 
               | Your interaction with LLMs is categorically closer to
               | interactions with people than with a calculator. Your
               | inputs into it are language.
               | 
               | Of course the two are different. A calculator is a
               | computer, an LLM is not. Comparing the two is making the
               | same category error which would confuse Mr. Babbage, but
               | in reverse.
               | 
               | ("On two occasions, I have been asked [by members of
               | Parliament], 'Pray, Mr. Babbage, if you put into the
               | machine wrong figures, will the right answers come out?'
               | I am not able to rightly apprehend the kind of confusion
               | of ideas that could provoke such a question.")
        
             | pohuing wrote:
             | It's a tool, not an intelligence, a tool that costs money
             | on every erroneous token. I expect my computer to be more
             | reliable at remembering things than myself, that's one of
             | the primary use cases even. Especially if using it costs
             | money. Of course errors are possible, but rarely do they
             | happen as frequently in any other program I use.
        
             | kortilla wrote:
             | If colleagues lie with the certainty that LLMs do, they
             | would get fired for incompetence.
        
               | scarab92 wrote:
               | I wish that were true, but I've found that certain types
               | of employees do confidently lie as much as llms,
               | especially when answering "do you understand" type
               | questions
        
               | izacus wrote:
               | And we try to PIP and fire those as well, not turn
               | everyone else into them.
        
               | dmd wrote:
               | Or elected to high office.
        
               | ChromaticPanic wrote:
               | Have you worked in an actual workplace. Confidence is
               | king.
        
             | ksec wrote:
             | I dont expect it to be 100% accurate. Software aren't bug
             | free, human aren't perfect. But may be 99.99%? At least
             | given enough time and resources human could fact check it
             | ourselves. And precisely because we know we are not
             | perfect, in accounting and court cases we have due
             | diligence.
             | 
             | And it is also not just about the %. It is also about the
             | type of error. Will we reach a point we change our
             | perception and say these are expected non-human error?
             | 
             | Or could we have a specific LLM that only checks for these
             | types of error?
        
             | mdp2021 wrote:
             | Yes we want people "in the game" to be of sound mind. (The
             | matter there is not about being accurate, but of being
             | trustworthy - substance, not appearance.)
             | 
             | And tools in the game, even more so (there's no excuse for
             | the engineered).
        
           | Foreignborn wrote:
           | Try dropping the entire api docs in the context. If it's
           | verbose, i usually pull only a subset of pages.
           | 
           | Usually I'm using a minimum of 200k tokens to start with
           | gemini 2.5.
        
             | nolist_policy wrote:
             | That's more than 222 novel pages:
             | 
             | 200k tk = 1/3 200k words = 1/300 1/3 200k pages
        
           | pizza wrote:
           | "if it were a fact, it wouldn't be called intelligence" -
           | donald rumsfeld
        
         | codebolt wrote:
         | I've found they do a decent job searching for bugs now as well.
         | Just yesterday I had a bug report on a component/page I wasn't
         | familiar with in our Angular app. I simply described the issue
         | as well as I could to Claude and asked politely for help
         | figuring out the cause. It found the exact issue correctly on
         | the first try and came up with a few different suggestions for
         | how to fix it. The solutions weren't quite what I needed but it
         | still saved me a bunch of time just figuring out the error.
        
           | M4v3R wrote:
           | That's my experience as well. Many bugs involve typos, syntax
           | issues or other small errors that LLMs are very good at
           | catching.
        
         | ChocolateGod wrote:
         | I asked today both Claude and ChatGPT to fix a Grafana Loki
         | query I was trying to build, both hallucinated functions that
         | didn't exist, even when telling to use existing functions.
         | 
         | To my surprise, Gemini got it spot on first time.
        
           | fwip wrote:
           | Could be a bit of a "it's always in the last place you look"
           | kind of thing - if Claude or CGPT had gotten it right, you
           | wouldn't have tried Gemini.
        
         | redox99 wrote:
         | Making LLMs know what they don't know is a hard problem. Many
         | attempts at making them refuse to answer what they don't know
         | caused them to refuse to answer things they did in fact know.
        
           | Volundr wrote:
           | > Many attempts at making them refuse to answer what they
           | don't know caused them to refuse to answer things they did in
           | fact know.
           | 
           | Are we sure they know these things as opposed to being able
           | to consistently guess correctly? With LLMs I'm not sure we
           | even have a clear definition of what it means for it to
           | "know" something.
        
             | redox99 wrote:
             | Yes. You could ask for factual information like "Tallest
             | building in X place" and first it would answer it did not
             | know. After pressuring it, it would answer with the correct
             | building and height.
             | 
             | But also things where guessing was desirable. For example
             | with a riddle it would tell you it did not know or there
             | wasn't enough information. After pressuring it to answer
             | anyway it would correctly solve the riddle.
             | 
             | The official llama 2 finetune was pretty bad with this
             | stuff.
        
               | Volundr wrote:
               | > After pressuring it, it would answer with the correct
               | building and height.
               | 
               | And if you bully it enough on something nonsensical it'll
               | give you a wrong answer.
               | 
               | You press it, and it takes a guess even though you told
               | it not to, and gets it right, then you go "see it knew!".
               | There's no database hanging out in
               | ChatGPT/Claude/Gemini's weights with a list of cities and
               | the tallest buildings. There's a whole bunch of opaque
               | stats derived from the content it's been trained on that
               | means that most of the time it'll come up with the same
               | guess. But there's no difference in process between that
               | highly consistent response to you asking the tallest
               | building in New York and the one where it hallucinates a
               | Python method that doesn't exist, or suggests glue to
               | keep the cheese on your pizza. It's all the same process
               | to the LLM.
        
             | ajross wrote:
             | > Are we sure they know these things as opposed to being
             | able to consistently guess correctly?
             | 
             | What is the practical difference you're imagining between
             | "consistently correct guess" and "knowledge"?
             | 
             | LLMs aren't databases. We have databases. LLMs are
             | probabilistic inference engines. _All they do_ is guess,
             | essentially. The discussion here is about how to get the
             | guess to  "check itself" with a firmer idea of "truth". And
             | it turns out that's hard because it requires that the
             | guessing engine know that something needs to be checked in
             | the first place.
        
               | mynameisvlad wrote:
               | Simple, and even simpler from your own example.
               | 
               | Knowledge has an objective correctness. We know that
               | there is a "right" and "wrong" answer and we know what a
               | "right" answer is. "Consistently correct guesses", based
               | on the name itself, is not reliable enough to actually be
               | trusted. There's absolutely no guarantee that the next
               | "consistently correct guess" is knowledge or a
               | hallucination.
        
               | ajross wrote:
               | This is a circular semantic argument. You're saying
               | knowledge is knowledge because it's correct, where
               | guessing is guessing because it's a guess. But "is it
               | correct?" is precisely the question you're asking the
               | poor LLM to answer in the first place. It's not helpful
               | to just demand a computation device work the way you
               | want, you need to actually make it work.
               | 
               | Also, too, there are whole subfields of philosophy that
               | make your statement here kinda laughably naive. Suffice
               | it to say that, no, knowledge as rigorously understood
               | does _not_ have  "an objective correctness".
        
               | mynameisvlad wrote:
               | I mean, it clearly does based on your comments showing a
               | need for a correctness check to disambiguate between made
               | up "hallucinations" and actual "knowledge" (together, a
               | "consistently correct guess").
               | 
               | The fact that you are humanizing an LLM is honestly just
               | plain weird. It does not have feelings. It doesn't care
               | that it has to answer "is it correct?" and saying _poor_
               | LLM is just trying to tug on heartstrings to make your
               | point.
        
               | ajross wrote:
               | FWIW "asking the poor <system> to do <requirement>" is an
               | extremely common idiom. It's used as a metaphor for an
               | inappropriate or unachievable design requirement. Nothing
               | to do with LLMs. I work on microcontrollers for a living.
        
               | Volundr wrote:
               | > You're saying knowledge is knowledge because it's
               | correct, where guessing is guessing because it's a guess.
               | 
               | Knowledge is knowledge because the knower knows it to be
               | correct. I know I'm typing this into my phone, because
               | it's right here in my hand. I'm guessing you typed your
               | reply into some electronic device. I'm guessing this is
               | true for all your comments. Am I 100% accurate? You'll
               | have to answer that for me. I don't know it to be true,
               | it's a highly informed guess.
               | 
               | Being wrong sometimes is not what makes a guess a guess.
               | It's the different between pulling something from your
               | memory banks, be they biological or mechanical, vs
               | inferring it from some combination of your knowledge
               | (what's in those memory banks), statistics, intuition,
               | and whatever other fairy dust you sprinkle on.
        
               | fwip wrote:
               | So, if that were so, then an LLM possess no knowledge
               | whatsoever, and cannot ever be trusted. Is that the line
               | of thought you are drawing?
        
               | Volundr wrote:
               | > What is the practical difference you're imagining
               | between "consistently correct guess" and "knowledge"?
               | 
               | Knowing it's correct. You've just instructed it not to
               | guess remember? With practice people can get really good
               | at guessing all sorts of things.
               | 
               | I think people have a serious misunderstanding about how
               | these things work. They don't have their training set
               | sitting around for reference. They are usually guessing.
               | Most of the time with enough consistency that it seems
               | like they "know'. Then when they get it wrong we call it
               | "hallucinations". But instructing then not to guess means
               | suddenly they can't answer much. There no guessing vs not
               | with an LLM, it's all the same statistical process, the
               | difference is just if it gives the right answer or not.
        
           | mountainriver wrote:
           | https://github.com/IINemo/lm-polygraph is the best work in
           | this space
        
           | rdtsc wrote:
           | > Making LLMs know what they don't know is a hard problem.
           | Many attempts at making them refuse to answer what they don't
           | know caused them to refuse to answer things they did in fact
           | know.
           | 
           | They are the perfect "fake it till you make it" example
           | cranked up to 11. They'll bullshit you, but will do it
           | confidently and with proper grammar.
           | 
           | > Many attempts at making them refuse to answer what they
           | don't know caused them to refuse to answer things they did in
           | fact know.
           | 
           | I can see in some contexts that being desirable if it can be
           | a parameter that can be tweaked. I guess it's not that easy,
           | or we'd already have it.
        
           | bezier-curve wrote:
           | The best way around this is to dump documentation of the APIs
           | you need them privy to into their context window.
        
         | pzo wrote:
         | I feel your pain. Cursor has docs features but many times when
         | I pointed to check @docs and selected one recently indexed one
         | it sometimes still didn't get it. I still have to try contex7
         | mcp which looks promising:
         | 
         | https://github.com/upstash/context7
        
         | doug_durham wrote:
         | If they never get good at abstraction or architecture they will
         | still provide a tremendous amount of value. I have them do the
         | parts of my job that I don't like. I like doing abstraction and
         | architecture.
        
           | mynameisvlad wrote:
           | Sure, but that's not the problem people have with them nor
           | the general criticism. It's that people without the knowledge
           | to do abstraction and architecture don't realize the
           | importance of these things and pretend that "vibe coding" is
           | a reasonable alternative to a well-thought-out project.
        
             | sanderjd wrote:
             | The way I see this is that it's just another skill
             | differentiator that you can take advantage of if you can
             | get it right.
             | 
             | That is, if it's true that abstraction and architecture are
             | useful for a given product, then people who know how to do
             | those things will succeed in creating that product, and
             | those who don't will fail. I think this is true for
             | essentially all production software, but a lot of software
             | never reaches production.
             | 
             | Transitioning or entirely recreating "vibecoded" proofs of
             | concept to production software is another skill that will
             | be valuable.
             | 
             | Having a good sense for when to do that transition, or when
             | to start building production software from the start, and
             | especially the ability to influence decision makers to
             | agree with you, is another valuable skill.
             | 
             | I do worry about what the careers of entry level people
             | will look like. It isn't obvious to me how they'll
             | naturally develop any of these skills.
        
               | mynameisvlad wrote:
               | > "vibecoded" proofs of concept
               | 
               | The fact that you called it out as a PoC is already many
               | bars above what most vibe coders are doing. Which is
               | considering a barely functioning web app as proof that
               | vibe coding is a viable solution for coding in general.
               | 
               | > I do worry about what the careers of entry level people
               | will look like. It isn't obvious to me how they'll
               | naturally develop any of these skills.
               | 
               | Exactly. There isn't really a path forward from vibe
               | coding to anything productizable without actual, deep CS
               | knowledge. And LLMs are not providing that.
        
               | sanderjd wrote:
               | Yeah I think we largely agree. But I do know people,
               | mostly experienced product managers, who are excited
               | about "vibecoding" expressly as a prototyping / demo
               | creation tool, which can be useful in conjunction with
               | people who know how to turn the prototypes into real
               | software.
               | 
               | I'm sure lots of people aren't seeing it this way, but
               | the point I was trying to make about this being a skill
               | differentiator is that I think understanding the
               | advantages, limitations, and tradeoffs, and keeping that
               | understanding up to date as capabilities expand, is
               | already a valuable skillset, and will continue to be.
        
             | Karrot_Kream wrote:
             | We can rewind the clock 10 years and I can substitute "vibe
             | coding" for VBA/Excel macros and we'd get a common type of
             | post from back then.
             | 
             | There's always been a demand for programming by non
             | technical stakeholders that they try and solve without
             | bringing on real programmers. No matter the tool, I think
             | the problem is evergreen.
        
         | Tainnor wrote:
         | I definitely get more use out of Gemini Pro than other models
         | I've tried, but it's still very prone to bullshitting.
         | 
         | I asked it a complicated question about the Scala ZIO framework
         | that involved subtyping, type inference, etc. - something that
         | would definitely be hard to figure out just from reading the
         | docs. The first answer it gave me was very detailed, very
         | convincing and very wrong. Thankfully I noticed it myself and
         | was able to re-prompt it and I got an answer that is probably
         | right. So it was useful in the end, but only because I realised
         | that the first answer was nonsense.
        
         | gxs wrote:
         | Huh? Have you ever just told it, that API doesn't exist, find
         | another solution?
         | 
         | Never seen it fumble that around
         | 
         | Swear people act like humans themselves don't ever need to be
         | asked for clarification
        
         | 0x457 wrote:
         | I've noticed that models that can search internet do it a lot
         | less because I guess they can look up documentation? My
         | annoyance now is that it doesn't take version into
         | consideration.
        
         | tough wrote:
         | You should give it docs for each of your base dependencies in a
         | mcp/tool whatever so it can just consult.
         | 
         | internet also helps.
         | 
         | Also having markdown files with the stack etc and any -rules-
        
         | siscia wrote:
         | This problem have been solved by LSP (language server
         | protocol), all we need is a small server behind MCP that can
         | communicate LSP information back to the LLM and get the LLM to
         | use by adding to the prompt something like: "check your API
         | usage with the LSP"
         | 
         | The unfortunate state of open source funding makes buildings
         | such simple tool a loosing adventure unfortunately.
        
           | satvikpendem wrote:
           | This already happens in agent modes in IDEs like Cursor or
           | VSCode with Copilot, it can check for errors with the LSP.
        
         | jug wrote:
         | I've seen benchs on hallucinations and OpenAI has typically
         | performed worse than Google and Anthropic models. Sometimes
         | significantly so. But it doesn't seem like they have cared
         | much. I've suspected that LLM performance is correlated to
         | risking hallucinations? That is, if they're bolder, this can be
         | beneficial? Which helps in other performance benchmarks. But of
         | course at the risk of hallucinating more...
        
           | mountainriver wrote:
           | The hallucinations are a result of RLVR. We reward the model
           | for an answer and then force it to reason about how to get
           | there when the base model may not have that information.
        
             | mdp2021 wrote:
             | > _The hallucinations are a result of RLVR_
             | 
             | Well let us reward them for producing output that is
             | consistent with database accessed selected documentation
             | then, and massacre them for output they cannot justify -
             | like we do with humans.
        
         | johnisgood wrote:
         | > hallucinate APIs
         | 
         | Tell me about it. Thankfully I have not experienced it as much
         | with Claude as I did with GPT. It can get quite annoying. GPT
         | kept telling me to use this and that and none of them were real
         | projects.
        
         | impulser_ wrote:
         | Use few-shot learning. Build a simple prompt with basic
         | examples of how to use the API and it will do significantly
         | better.
         | 
         | LLMs just guess, so you have to give it a cheatsheet to help it
         | guess closer to what you want.
        
           | rcpt wrote:
           | I'm using repomix for this
        
           | M4v3R wrote:
           | At this point the time it takes to teach the model might be
           | more than you save from using it for interacting with that
           | API.
        
         | Jordan-117 wrote:
         | I recently needed to recommend some IAM permissions for an
         | assistant on a hobby project; not complete access but just
         | enough to do what was required. Was rusty with the console and
         | didn't have direct access to it at the time, but figured it was
         | a solid use case for LLMs since AWS is so ubiquitous and well-
         | documented. I actually queried 4o, 3.7 Sonnet, and Gemini 2.5
         | for recommendations, stripped the list of duplicates, then
         | passed the result to Gemini to vet and format as JSON. The
         | result was perfectly formatted... and still contained a bunch
         | of non-existent permissions. My first time being burned by a
         | hallucination IRL, but just goes to show that even the latest
         | models working in concert on a very well-defined problem space
         | can screw up.
        
           | dotancohen wrote:
           | AWS docs have (had) an embedded AI model that would do this
           | perfectly. I suppose it had better training data, and the
           | actual spec as a RAG.
        
             | djhn wrote:
             | Both AWS and Azure docs' built in models have been
             | absolutely useless.
        
           | darepublic wrote:
           | Listen I don't blame any mortal being for not grokking the
           | AWS and Google docs. They are a twisting labyrinth of
           | pointers to pointers some of them deprecated though
           | recommended by Google itself.
        
           | perching_aix wrote:
           | Sounds like a vague requirement, so I'd just generally point
           | you towards the AWS managed policies summary [0] instead.
           | Particularly the PowerUserAccess policy sounds fitting here
           | [1] if the description for it doesn't raise any immediate
           | flags. Alternatively, you could browse through the job
           | function oriented policies [2] they have and see if you find
           | a better fit. Can just click it together instead of bothering
           | with the JSON. Though it sounds like you're past this problem
           | by now.
           | 
           | [0] https://docs.aws.amazon.com/IAM/latest/UserGuide/access_p
           | oli...
           | 
           | [1] https://docs.aws.amazon.com/aws-managed-
           | policy/latest/refere...
           | 
           | [2] https://docs.aws.amazon.com/IAM/latest/UserGuide/access_p
           | oli...
        
         | satvikpendem wrote:
         | If you use Cursor, you can use @Docs to let it index the
         | documentation for the libraries and languages you use, so no
         | hallucination happens.
        
           | Rudybega wrote:
           | The context7 mcp works similarly. It allows you to search a
           | massive constantly updated database of relevant documentation
           | for thousands of projects.
        
         | thr0waway39290 wrote:
         | Replacing stackoverflow is definitely helpful, but the best use
         | case for me is how much it helps in high-level architecture and
         | planning before starting a project.
        
         | yousif_123123 wrote:
         | The opposite problem is also true. I was using it to edit code
         | I had that was calling the new openai image API, which is
         | slightly different from the dalle API. But Gemini was
         | consistently "fixing" the OpenAI call even when I explained
         | clearly not to do that since I'm using a new API design etc.
         | Claude wasn't having that issue.
         | 
         | The models are very impressive. But issues like these still
         | make me feel they are still more pattern matching (although
         | there's also some magic, don't get me wrong) but not fully
         | reasoning over everything correctly like you'd expect of a
         | typical human reasoner.
        
           | toomuchtodo wrote:
           | It seems like the fix is straightforward (check the output
           | against a machine readable spec before providing it to the
           | user), but perhaps I am a rube. This is no different than me
           | clicking through a search result to the underlying page to
           | verify the veracity of the search result surfaced.
        
             | disgruntledphd2 wrote:
             | Why coding agents et al don't make use of the AST through
             | LSP is a question I've been asking myself since the first
             | release of GitHub copilot.
             | 
             | I assume that it's trickier than it seems as it hasn't
             | happened yet.
        
               | celeritascelery wrote:
               | What good do you think that would do?
        
           | disgruntledphd2 wrote:
           | They are definitely pattern matching. Like, that's how we
           | train them, and no matter how many layers of post training
           | you add, you won't get too far from next token prediction.
           | 
           | And that's fine and useful.
        
             | mdp2021 wrote:
             | > _fine and useful_
             | 
             | And crippled, incomplete, and deceiving, dangerous.
        
         | mannycalavera42 wrote:
         | same, I asked a simple question about javascript fetch api and
         | it started talking about the workspace api. When I asked about
         | that workspace api it replied it was the google workspace API -
         | \ _ (tsu) _ / -
        
         | mbesto wrote:
         | To date, LLMs can't replace the human element of:
         | 
         | - Determining what features to make for users
         | 
         | - Forecasting out a roadmap that are aligned to business goals
         | 
         | - Translating and prioritizing all of these to a developer
         | (regardless of whether these developers are agentic or human)
         | 
         | Coincidentally these are the areas that frequently are the
         | largest contributors to software businesses successes....not
         | wether you use NextJs with a Go and Elixir backend against a
         | multi-geo redundant multi sharded CockroachDB database, or that
         | your code is clean/elegant.
        
           | nearbuy wrote:
           | What does it say when you ask it to?
        
           | dist-epoch wrote:
           | Maybe at elite companies.
           | 
           | At half of the companies you can randomly pick those three
           | things and probably improve the situation. Using an AI would
           | be a massive improvement.
        
         | jstummbillig wrote:
         | > no amount of prompting will get current models to approach
         | abstraction and architecture the way a person does
         | 
         | I find this sentiment increasingly worrisome. It's entirely
         | clear that every last human will be beaten on code design in
         | the upcoming years (I am not going to argue if it's 1 or 5
         | years away, who cares?)
         | 
         | I wished people would just stop holding on to what amounts to
         | nothing, and think and talk more about what can be done in a
         | new world. We need good ideas and I think this could be a place
         | to advance them.
        
           | jjice wrote:
           | I'm confused by your comment. It seems like you didn't really
           | provide a retort to the parent's comment about bad
           | architecture and abstraction from LLMs.
           | 
           | FWIW, I think you're probably right that we need to adapt,
           | but there was no explanation as to _why_ you believe that
           | that's the case.
        
             | TuringNYC wrote:
             | I think they are pointing out that the advantage humans
             | have has been chipped away little by little and computers
             | winning at coding is inevitable on some timeline. They are
             | also suggesting that perhaps the GP is being defensive.
        
               | dml2135 wrote:
               | Why is it inevitable? Progress towards a goal in the past
               | does not guarantee progress towards that goal in the
               | future. There are plenty of examples of technology moving
               | forward, and then hitting a wall.
        
               | TuringNYC wrote:
               | I agree with you it isnt guaranteed to be inevitable, and
               | also agree there have been plenty of journeys which were
               | on a trajectory only to fall off.
               | 
               | That said, IMHO it is inevitable. My personal (dismal)
               | view is that businesses see engineering as a huge cost
               | center to be broken up and it will play out just like
               | manufacturing -- decimated without regard to the human
               | cost. The profit motive and cost savings are just too
               | great to not try. It is a very specific line item so
               | cost/savings attribution is visible and already tracked.
               | Finally, a good % of the industry has been staffed up
               | with under-trained workers (e.g., express bootcamp) who
               | arent working on abstraction, etc -- they are doing basic
               | CRUD work.
        
               | warkdarrior wrote:
               | > businesses see engineering as a huge cost center to be
               | [...] decimated without regard to the human cost
               | 
               | Most cost centers in the past were decimated in order to
               | make progress: from horse-drawn carriages to cars and
               | trucks, from mining pickaxes to mining machines, from
               | laundry at the river to clothes washing machines, etc. Is
               | engineering a particularly unique endeavor that needs to
               | be saved from automation?
        
           | saurik wrote:
           | I mean, didn't you just admit you are wrong? If we are
           | talking 1-5 years out, that's not "current models".
        
             | jstummbillig wrote:
             | Imagine sitting in a car, that is fast approaching a cliff,
             | with no brakes, while the driver talks about how they have
             | not been in any serious car accident so far.
             | 
             | Technically correct. And yet, you would probably be at
             | least be a little worried about that cliff and rather talk
             | about that.
        
           | mattgreenrocks wrote:
           | I'm always impressed by the ability of the comment section to
           | come up with more reasons why decent design and architecture
           | of source code just can't happen:
           | 
           | * "it's too hard!"
           | 
           | * "my coworkers will just ruin it"
           | 
           | * "startups need to pursue PMF, not architecture"
           | 
           | * "good design doesn't get you promoted"
           | 
           | And now we have "AI will do it better soon."
           | 
           | None of those are entirely wrong. They're not entirely
           | correct, either.
        
             | dullcrisp wrote:
             | It's always so aggressive too. What fools we are for trying
             | to write maintainable code when it's so obviously
             | impossible.
        
           | DanHulton wrote:
           | > It's entirely clear that every last human will be beaten on
           | code design in the upcoming years
           | 
           | Citation needed. In fact, I think this pretty clearly hits
           | the "extraordinary claims require extraordinary evidence"
           | bar.
        
             | coffeemug wrote:
             | AlphaGo.
        
               | giovannibonetti wrote:
               | A board game has a much narrower scope than programming
               | in general.
        
               | cft wrote:
               | Thus this was in 2016. 9 years have passed.
        
               | astrange wrote:
               | LLMs and AlphaGo don't work at all similarly, since LLMs
               | don't use search.
        
             | kaliqt wrote:
             | Trends would dictate that this will keep scaling and
             | surpass each goalpost year by year.
        
             | sweezyjeezy wrote:
             | I would argue that what LLMs are capable of doing right now
             | is already pretty extraordinary, and would fulfil your
             | extraordinary evidence request. To turn it on its head -
             | given the rather astonishing success of the recent LLM
             | training approaches, what evidence do you have that these
             | models are going to plateau short of your own abilities?
        
               | sigmaisaletter wrote:
               | What they do is extraordinary, but it's not just a claim,
               | they actually do, their doing so is evidence.
               | 
               | Here someone just claimed that it is "entirely clear"
               | LLMs will become super-human, without any evidence.
               | 
               | https://en.wikipedia.org/wiki/Extraordinary_claims_requir
               | e_e...
        
               | sweezyjeezy wrote:
               | Again - I'd argue that the extraordinary success of LLMs,
               | in a relatively short amount of time, using a fairly
               | unsophisticated training approach, is strong evidence
               | that coding models are going to get a lot better than
               | they are right now. Will it definitely surpass every
               | human? I don't know, but I wouldn't say we're lacking
               | extraordinary evidence for that claim either.
               | 
               | The way you've framed it seems like the only evidence you
               | will accept is after it's actually happened.
        
               | sigmaisaletter wrote:
               | Well, predicting the future is always hard. But if
               | someone claims some extraordinary future event is going
               | to happen, you at least ask for their reasons for
               | claiming so, don't you.
               | 
               | In my mind, at this point we either need (a) some
               | previously "hidden" super-massive source of training
               | data, or (b) another architectural breakthrough. Without
               | either, this is a game of optimization, and the scaling
               | curves are going to plateau really fast.
        
               | sweezyjeezy wrote:
               | A couple of comments there -
               | 
               | a) it hasn't even been a year since the last big
               | breakthrough, the reasoning models like o3 only came out
               | in September. I'd wait a second before assuming the low-
               | hanging fruit is done.
               | 
               | b) I think coding is a really good environment for agents
               | / reinforcement learning. Rather than requiring a
               | continual supply of new training data, we give the model
               | coding tasks to execute (writing / maintaining /
               | modifying) and then test its code for correctness. We
               | could for example take the entire history of a code-base
               | and just give the model its changing unit + integration
               | tests to implement. My hunch (with no extraordinary
               | evidence) is that this is how coding agents start to nail
               | some of the higher-level abilities.
        
           | davidsainez wrote:
           | I use LLMs for coding every day. There have been significant
           | improvements over the years but mostly across a single
           | dimension: mapping human language to code. This capability is
           | robust, but you still have to know how to manage context to
           | keep them focused. I still have to direct them to consider
           | e.g. performance or architecture considerations.
           | 
           | I'm not convinced that they can reason effectively (see the
           | ARC-AGI-2 benchmarks). Doesn't mean that they are not useful,
           | but they have their limitations. I suspect we still need to
           | discover tech distinct from LLMs to get closer to what a
           | human brain does.
        
           | acedTrex wrote:
           | > It's entirely clear that every last human will be beaten on
           | code design in the upcoming years
           | 
           | In what world is this statement remotely true.
        
             | dullcrisp wrote:
             | In the world where idle speculation can be passed off as
             | established future facts, i.e., this one I guess.
        
             | 1024core wrote:
             | Proof by negation, I guess?
             | 
             | If someone were to claim: no computer will ever be able to
             | beat humans in code design, would you agree with that? If
             | the answer is "no", then there's your proof.
        
           | Workaccount2 wrote:
           | Software will change to accommodate LLMs, if for no other
           | reason than we are on the cusp of everyone being a junior
           | level programmer. What does software written for LLMs to
           | middleman look like?
           | 
           | I think there is a total seismic change in software that is
           | about to go down, similar to something like going from gas
           | lamps to electric. Software doesn't need to be the way it is
           | now anymore, since we have just about solved human language
           | to computer interface translation. I don't want to fuss with
           | formatting a word document anymore, I would rather just tell
           | and LLM and let it modify the program memory to implement
           | what I want.
        
           | epolanski wrote:
           | > no amount of prompting will get current models to approach
           | abstraction and architecture the way a person does
           | 
           | Which person it is? Because 90% of the people in our trade
           | are bad, like, real bad.
           | 
           | I get that people on HN are in that elitist niche of those
           | who care more, focus on career more, etc so they don't even
           | realize the existence of armies of low quality body rental
           | consultancies and small shops out there working on Magento or
           | Liferay or even worse crap.
        
           | bayindirh wrote:
           | > It's entirely clear that every last human will be beaten on
           | code design in the upcoming years (I am not going to argue if
           | it's 1 or 5 years away, who cares?)
           | 
           | No code & AI assisted programming has been told to be around
           | the corner since 2000. We just arrived to a point where
           | models remix what others have typed on their keyboards, and
           | yet somebody _still_ argues that humans will be left in the
           | dust in near times.
           | 
           | No machine, incl. humans can create something more complex
           | than itself. This is the rule of abstraction. As you go
           | higher level, you lose expressiveness. Yes, you express more
           | with less, yet you can express less _in total_. You 're
           | reducing the set's symbol size (element count) as you go
           | higher by clumping symbols together and assigning more
           | complex meanings to it.
           | 
           | Yet, being able to describe a larger set with more elements
           | while keeping all elements addressable with less possible
           | symbols doesn't sound plausible to me.
           | 
           | So, as others said. Citation needed. Extraordinary claims
           | needs extraordinary evidence. No, asking AI to create a
           | premium mobile photo app and getting Halide's design as an
           | output doesn't count. It's training data leakage.
        
           | joshjob42 wrote:
           | I mean, if you draw the scaling curves out and believe them,
           | then sometime in the next 3-10 years, plausibly shorter, AIs
           | will be able to achieve best-case human performance in
           | everything able to be done with a computer and do it at
           | 10-1000x less cost than a human, and shortly thereafter
           | robots will be able to do something similar (though with a
           | smaller delta in cost) for physical labor, and then shortly
           | after that we get atomically precise manufacturing and post-
           | scarcity. So the amount of stuff that amounts to nothing is
           | plausibly every field of endeavor that isn't slightly
           | advancing or delaying AI progress itself.
        
             | sigmaisaletter wrote:
             | If the scaling continues. We just don't know.
             | 
             | It is kinda a meme at this point, that there is no more
             | "publicly available"... cough... training data. And while
             | there have been massive breakthroughs in architecture, a
             | lot of the progress of the last couple years has been ever
             | more training for ever larger models.
             | 
             | So, at this point we either need (a) some previously
             | "hidden" super-massive source of training data, or (b)
             | another architectural breakthrough. Without either, this is
             | a game of optimization, and the scaling curves are going to
             | plateau really fast.
        
           | bdangubic wrote:
           | _It 's entirely clear that every last human will be beaten on
           | code design in the upcoming years (I am not going to argue if
           | it's 1 or 5 years away, who cares?)_
           | 
           | Our entire industry (after all these years) does not have
           | even remotely sane measure or definition as what is good code
           | design. Hence, this statement is dead on arrival as you are
           | claiming something that cannot be either proven or disproven
           | by anyone.
        
         | froh wrote:
         | searching and ranking existing fragments and recombining them
         | within well known paths is one thing, exploratively combining
         | existing fragments to completely novel solutions quickly runs
         | into combinatorial explosion.
         | 
         | so it's a great tool in the hands of a creative architect, but
         | it is not one in and by itself and I don't see yet how it can
         | be.
         | 
         | my pet theory is that the human brain can't understand and
         | formalize its creativity because you need a higher order logic
         | to fully capture some other logic. I've been contested that the
         | second Godel incompleteness theorem "can't be applied like this
         | to the brain" but I stubbornly insist yes, the brain implements
         | _some_ formal system and it can't understand how that system
         | works. tongue in cheek, somewhat, maybe.
         | 
         | but back to earth I agree llms are a great tool for a creative
         | human mind.
        
           | dist-epoch wrote:
           | > Demystifying Godel's Theorem: What It Actually Says
           | 
           | > If you think his theorem limits human knowledge, think
           | again
           | 
           | https://www.youtube.com/watch?v=OH-ybecvuEo
        
             | froh wrote:
             | thanks for the pointer.
             | 
             | first, with Neil DeGrasse Tyson I feel in fairly ok company
             | with my little pet peeve fallacy ;-)
             | 
             | yah as I said, I both get it and don't ;-)
             | 
             | And then the video escapes me saying statements about the
             | brain "being a formal method" can't be made "because" the
             | finite brain can't hold infinity.
             | 
             | that's beyond me. although obviously the brain can't
             | enumerate infinite possibilities, we're still fairly well
             | capable of formal thinking, aren't we?
             | 
             | And many lovely formal systems nicely fit on fairly finite
             | paper. And formal proofs can be run on finite computers.
             | 
             | So somehow the logic in the video is beyond me.
             | 
             | My humble point is this: if we build "intelligence" as a
             | formal system, like some silicon running some fancy pants
             | LLM what have you, and we want rigor in it's construction,
             | i.e. if we want to be able to tell "this is how it works",
             | then we need to use a subset of our brain that's capable of
             | formal and consistent thinking. And my claim is that _that
             | subsystem_ can't capture "itself". So we have to use "more"
             | of our brain than that subsystem. so either the "AI" that
             | we understand is "less" than what we need and use to
             | understand it. or we can't understand it.
             | 
             | I fully get our brain is capable of more, and this "more"
             | is obviously capable of very inconsistent outputs, HAL 9000
             | had that problem, too ;-)
             | 
             | I'm an old woman. it's late at night.
             | 
             | When I sat through Godel back in the early 1990s in CS and
             | then in contrast listened to the enthusiastic AI lectures
             | it didn't sit right with me. Maybe one of the AI Prof's
             | made that tactical mistake to call our brain "wet
             | biological hardware" in contrast to "dry silicon hardware".
             | but I can't shake of that analogy ;-) I hope I'm wrong :-)
             | "real" AI that we can trust because we can reason about
             | it's inner workings will be fun :-)
        
           | breuleux wrote:
           | > I've been contested that the second Godel incompleteness
           | theorem "can't be applied like this to the brain" but I
           | stubbornly insist yes, the brain implements _some_ formal
           | system and it can't understand how that system works
           | 
           | I would argue that the second incompleteness theorem doesn't
           | have much relevance to the human brain, because it is trying
           | to prove a falsehood. The brain is blatantly _not_ a
           | consistent system. It is, however, paraconsistent: we are
           | perfectly capable of managing a set of inconsistent premises
           | and extracting useful insight from them. That 's a good
           | thing.
           | 
           | It's also true that we don't understand how our own brain
           | works, of course.
        
         | jppittma wrote:
         | I've had great success by asking it to do project design first,
         | compose the design into an artifact, and then asking it to
         | consult the design artifact as it writes code.
        
           | epaga wrote:
           | This is a great idea - do you have a more detailed overview
           | of this approach and/or an example? What types of things do
           | you tell it to put into the "artefact"?
        
         | pdntspa wrote:
         | I don't know about that, my own adventures with Gemini Pro 2.5
         | in Roo Code has it outputting code in a style that is very
         | close to my own
         | 
         | While far from perfect for large projects, controlling the
         | scope of individual requests (with orchestrator/boomerang mode,
         | for example) seems to do wonders
         | 
         | Given the sheer, uh, variety of code I see day to day in an
         | enterprise setting, maybe the problem isn't with Gemini?
        
         | abletonlive wrote:
         | I feel like there are two realities right now where half the
         | people say LLM doesn't do anything well and there is another
         | half that's just using LLM to the max. Can everybody preface
         | what stack they are using or what exactly they are doing so we
         | can better determine why it's not working for you? Maybe even
         | include what your expectations are? Maybe even tell us what
         | models you're using? How are you prompting the models exactly?
         | 
         | I find for 90% of the things I'm doing LLM removes 90% of the
         | starting friction and let me get to the part that I'm actually
         | interested in. Of course I also develop professionally in a
         | python stack and LLMs are 1 shotting a ton of stuff. My work is
         | standard data pipelines and web apps.
         | 
         | I'm a tech lead at faang adjacent w/ 11YOE and the systems I
         | work with are responsible for about half a billion dollars a
         | year in transactions directly and growing. You could argue
         | maybe my standards are lower than yours but I think if I was
         | making deadly mistakes the company would have been on my ass by
         | now or my peers would have caught them.
         | 
         | Everybody that I work with is getting valuable output from
         | LLMs. We are using all the latest openAI models and have a
         | business relationship with openAI. I don't think I'm even that
         | good at prompting and mostly rely on "vibes". Half of the time
         | I'm pointing the model to an example and telling it "in the
         | style of X do X for me".
         | 
         | I feel like comments like these almost seem gaslight-y or maybe
         | there's just a major expectation mismatch between people. Are
         | you expecting LLMs to just do exactly what you say and your
         | entire job is to sit back prompt the LLM? Maybe I'm just use to
         | shit code but I've looked at many code bases and there is a
         | huge variance in quality and the average is pretty poor. The
         | average code that AI pumps out is much better.
        
           | oparin10 wrote:
           | I've had the opposite experience. Despite trying various
           | prompts and models, I'm still searching for that mythical 10x
           | productivity boost others claim.
           | 
           | I use it mostly for Golang and Rust, I work building cloud
           | infrastructure automation tools.
           | 
           | I'll try to give some examples, they may seem overly specific
           | but it's the first things that popped into my head when
           | thinking about the subject.
           | 
           | Personally, I found that LLMs consistently struggle with
           | dependency injection patterns. They'll generate tightly
           | coupled services that directly instantiate dependencies
           | rather than accepting interfaces, making testing nearly
           | impossible.
           | 
           | If I ask them to generate code and also their respective unit
           | tests, they'll often just create a bunch of mocks or start
           | importing mock libraries to compensate for their faulty
           | implementation, rather than fixing the underlying
           | architectural issues.
           | 
           | They consistently fail to understand architecture patterns,
           | generating code where infrastructure concerns bleed into
           | domain logic. When corrected, they'll make surface level
           | changes while missing the fundamental design principle of
           | accepting interfaces rather than concrete implementations,
           | even when explicitly instructed that it should move things
           | like side-effects to the application edges.
           | 
           | Despite tailoring prompts for different models based on
           | guides and personal experience, I often spend 10+ minutes
           | correcting the LLM's output when I could have written the
           | functionality myself in half the time.
           | 
           | No, I'm not expecting LLMs to replace my job. I'm expecting
           | them to produce code that follows fundamental design
           | principles without requiring extensive rewriting. There's a
           | vast middle ground between "LLMs do nothing well" and the
           | productivity revolution being claimed.
           | 
           | That being said, I'm glad it's working out so well for you, I
           | really wish I had the same experience.
        
             | abletonlive wrote:
             | > I use it mostly for Golang and Rust
             | 
             | I'm starting to suspect this is the issue. Neither of these
             | languages are in the top 5 languages so there is probably
             | less to train on. It'd be interesting to see if this
             | improves over time or if the gap between the languages
             | become even more intense as it becomes favorable to use a
             | language simply because LLMs are so much better at it.
             | 
             | There are a lot of interesting discussions to be had here:
             | 
             | - if the efficiency gains are real and llms don't improve
             | in lesser used languages, one should expect that we might
             | observe that companies that chose to use obscure languages
             | and tech stacks die out as they become a lot less
             | competitive against stacks that are more compatible with
             | llms
             | 
             | - if the efficiency gains are real this might
             | disincentivize new language adoption and creation unless
             | the folks training models somehow address this
             | 
             | - languages like python with higher output acceptance rates
             | are probably going to become even more compatible with llms
             | at a faster rate if we extrapolate that positive
             | reinforcement is probably more valuable than negative
             | reinforcement for llms
        
               | oparin10 wrote:
               | Yes, I agree, that's likely a big factor. I've had a
               | noticeably better LLM design experience using widely
               | adopted tech like TypeScript/React.
               | 
               | I do wonder if the gap will keep widening though. If
               | newer tools/tech don't have enough training data, LLMs
               | may struggle more with them early on. Its possible that
               | RAG and other optimization techniques will evolve fast
               | enough to narrow the gap and prevent diminishing returns
               | on LLM driven productivity.
        
               | Implicated wrote:
               | I'm also suspecting this has a lot to do with the
               | dichotomy between the "omg llms are amazing at code
               | tasks" and "wtf are these people using these llms for
               | it's trash" takes.
               | 
               | As someone who works primarily within the Laravel stack,
               | in PHP, the LLM's are wildly effective. That's not to say
               | there aren't warts - but my productivity has skyrocketed.
               | 
               | But it's become clear that when you venture into the
               | weeds of things that aren't very mainstream you're going
               | to get wildly more hallucinations and solutions that are
               | puzzling.
               | 
               | Another observation is that I believe that when you start
               | getting outside of your expertise you're likely going to
               | have a correlating amount of 'waste' time spent where the
               | LLM is spitting out solutions that an expert in the
               | domain would immediately recognize as problematic but the
               | non-expert will see and likely reason that it seems
               | reasonable/or, worse, not even look at the solution and
               | just try to use it.
               | 
               | 100% of the time that I've tried to get
               | Claude/Gemini/ChatGPT to "one shot" a whole feature or
               | refactor it's been a waste of time and tokens. But when
               | I've spent even a minimal amount of energy to focus it in
               | on the task, curate the context and then approach?
               | Tremendously effective most times. But this also requires
               | me to do enough mental work that I probably have an idea
               | of how it should work out which primes my capability to
               | parse the proposed solutions/code and pick up the pieces.
               | Another good flow is to just prompt the LLM (in this
               | case, Claude Code, or something with MCP/filesystem
               | access) with the feature/refactor/request asking it to
               | draw up the initial plan of implementation to feed to
               | itself. Then iterate on that as needed before starting up
               | a new session/context with that plan and hitting it one
               | item at a time, while keeping a running
               | {TASK_NAME}_WORKBOOK.md (that you task the llm to keep up
               | to date with the relevant details) and starting a new
               | session/context for each task/item on the plan, using the
               | workbook to get the new sessions up to speed.
               | 
               | Also, this is just a hunch, but I'm generally a nocturnal
               | creature and tend to be working in the evening into early
               | mornings. Once 8am PST rolls around I really feel like
               | Claude (in particular) just turns into mush. Responses
               | get slower but it seems it loses context where it
               | otherwise wouldn't start getting off topic/having to re-
               | read files it should already have in context. (Note; I'm
               | pretty diligent about refreshing/working with the context
               | and something happens in the 'work' hours to make it
               | terrible)
               | 
               | I'd imagine we're going to end up with language specific
               | llms (though I have no idea, just seems logical to me)
               | that a 'main' model pushes tasks/tool usage to. We don't
               | need our "coding" LLM's to also be proficient on oceanic
               | tidal patterns and 1800's boxing history. Those are all
               | parameters that could have been better spent on the code.
        
           | thewebguyd wrote:
           | I've found, like you mentioned, that the tech stack you work
           | with matters a lot in terms of successful results from LLMs.
           | 
           | Python is generally fine, as you've experienced, as is
           | JavaScript/TypeScript & React.
           | 
           | I've had mixed results with C# and PowerShell. With
           | PowerShell, hallucinations are still a big problem. Not sure
           | if it's the Noun-Verb naming scheme of cmdlets, but most
           | models still make up cmdlets that don't exist on the fly
           | (though will correct itself once you correct it that it
           | doesn't exist but at that point - why bother when I can just
           | do it myself correctly the first time).
           | 
           | With C#, even with my existing code as context, it can't
           | adhere to a consistent style, and can't handle nullable
           | reference types (albeit, a relatively new feature in C#). It
           | works, but I have to spend too much time correcting it.
           | 
           | Given my own experiences and the stacks I work with, I still
           | won't trust an LLM in agent mode. I make heavy use of them as
           | a better Google, especially since Google has gone to shit,
           | and to bounce ideas off of, but I'll still write the code
           | myself. I don't like reviewing code, and having LLMs write
           | code for me just turns me into a full time code reviewer, not
           | something I'm terribly interested in becoming.
           | 
           | I still get a lot of value out of the tools, but for me I'm
           | still hesitant to unleash them on my code directly. I'll
           | stick with the chat interface for now.
           | 
           |  _edit_ Golang is another language I 've had problems relying
           | on LLMs for. On the flip side, LLMs have been great for me
           | with SQL and I'm grateful for that.
        
             | neonsunset wrote:
             | FWIW If you are using Github Copilot Edit/Agent mode - you
             | may have more luck with other plugins. Until recently,
             | Claude 3. _5_ Sonnet worked really well with C# and
             | required relatively few extra commands to stay consistent
             | to  "newest tersest" style. But then, from my
             | understanding, there was a big change in how Copilot
             | extension handles attached context alongside changes around
             | what I presume prompt and fine-tuning which resulted in
             | severe degradation of the output quality. Hell, even
             | attaching context data does not properly work 1 out of 3
             | times. But at least Gemini 2.5 Pro can write test semi-
             | competently, but I still can't fathom how did the manage to
             | make it so much worse!
        
         | bboygravity wrote:
         | This is hilarious to read if you have actually seen the average
         | (embedded systems) production code written by humans.
         | 
         | Either you have no idea how terrible real world commercial
         | software (architecture) is or you're vastly underestimating
         | newer LLMs or both.
        
         | onlyrealcuzzo wrote:
         | 2.5 pro seems like a huge improvement.
         | 
         | One area I've still noticed weakness is if you want to use a
         | pretty popular library from one language in another language,
         | it has a tendency to think the function signatures in the
         | popular language match the other.
         | 
         | Naively, this seems like a hard problem to solve.
         | 
         | I.e. ask it how to use torchlib in Ruby instead of Python.
        
         | viraptor wrote:
         | > no amount of prompting will get current models to approach
         | abstraction and architecture the way a person does.
         | 
         | What do you mean specifically? I found the "let's write a spec,
         | let's make a plan, implement this step by step with testing"
         | results in basically the same approach to design/architecture
         | that I would take.
        
         | nurettin wrote:
         | Just tell it to cite docs when using functions, works wonders.
        
         | tastysandwich wrote:
         | Re hallucinating APIs that don't exist - I find this with
         | Golang sometimes. I wonder if it's because the training data
         | doesn't just consist of all the docs and source code, but
         | potentially feature proposals that never made it into the
         | language.
         | 
         | Regexes are another area where I can't get much help from LLMs.
         | If it's something common like a phone number, that's fine. But
         | anything novel it seems to have trouble. It will spit out junk
         | very confidently.
        
       | xnx wrote:
       | This is much bigger news than OpenAI's acquisition of WindSurf.
        
       | herpdyderp wrote:
       | I agree it's very good but the UI is still usually an unusable,
       | scroll-jacking disaster. I've found it's best to let a chat sit
       | for around a few minutes after it has finished printing the AI's
       | output. Finding the `ms-code-block` element in dev tools and
       | logging `$0.textContext` is reliable too.
        
         | OsrsNeedsf2P wrote:
         | Loading the UI on mobile while on low bandwidth is also a non-
         | starter. It simply doesn't work.
        
         | uh_uh wrote:
         | Noticed this too. There's something funny about billion dollar
         | models being handicapped by stuck buttons.
        
           | energy123 wrote:
           | The Gemini app has a number of severe bugs that impacts
           | everyone who uses it, and those bugs have persisted for over
           | 6 months.
           | 
           | There's something seriously dysfunctional and incompetent
           | about the team that built that web app. What a way to waste
           | the best LLM in the world.
        
             | kubb wrote:
             | It's the company. Letting incompetent people who are vocal
             | rise to the top is a part of Google's culture, and the
             | internal performance review process discourages excellence
             | - doing the thousand small improvements that makes a
             | product truly great is invisible to it, so nobody does it.
             | 
             | Software that people truly love is impossible to build in
             | there.
        
       | crat3r wrote:
       | So, are people using these tools without the org they work for
       | knowing? The amount of hoops I would have to jump through to get
       | either of the smaller companies I have worked for since the AI
       | boom to let me use a tool like this would make it absolutely not
       | worth the effort.
       | 
       | I'm assuming large companies are mandating it, but ultimately the
       | work that these LLMs seem poised for would benefit smaller
       | companies most and I don't think they can really afford using
       | them? Are people here paying for a personal subscription and then
       | linking it to their work machines?
        
         | jeffbee wrote:
         | Not every coding task is something you want to check into your
         | repo. I have mostly used Gemini to generate random crud. For
         | example I had a huge JSON representation of a graph, and I
         | wanted the graph modified in a given way, and I wanted it
         | printed out on my terminal in color. None of which I was
         | remotely interested in writing, so I let a robot do it and it
         | was fine.
        
           | crat3r wrote:
           | Fair, but I am seeing so much talk about how it is completing
           | actual SDE tickets. Maybe not this model specifically, but to
           | be honest I don't care about generating dummy data, I care
           | about the claims that these newer models are on par with
           | junior engineers.
           | 
           | Junior engineers will complete a task to update an API, or
           | fix a bug on the front-end, within a couple days with lets
           | say 80 percent certainty they hit the mark (maybe an inflated
           | metric). How are people comparing the output of these models
           | to that of a junior engineer if they generally just say "Here
           | is some of my code, what's wrong with it?". That certainly
           | isn't taking a real ticket and completing it in any capacity.
           | 
           | I am obviously very skeptical but mostly I want to try one of
           | these models myself but in reality I think that my higher-ups
           | would think that they introduce both risk AND the potential
           | for major slacking off haha.
        
             | jpc0 wrote:
             | I don't know about tickets but my org definitely happily
             | pays for Gemini Advanced and encourages it's use and would
             | be considered a small org.
             | 
             | The latest SOTA models are definitely at the point where
             | they can absolutely improve workflows and not get in your
             | way too much.
             | 
             | I treat it a lot like an intern, "Here's an api doc and
             | spec, write me the boilerplate and a general idea about
             | implementation"
             | 
             | Then I go in, review, rip put crud and add what I need.
             | 
             | It almost always gets architecture wrong, don't expect that
             | from it. However small functions and such is great.
             | 
             | When it comes to refactoring ask it for suggestions, eat
             | the meat leave the bones.
        
         | bongodongobob wrote:
         | I work for a large company and everything other than MS Copilot
         | is blocked aggressively at the DNS/cert level. Tried Deepseek
         | when it came out and they already had it blocked. All .ai TLDs
         | are blocked as well. If you're not in tech, there is a lot of
         | "security" fear around AI.
        
         | codebolt wrote:
         | If you can get them to approve GitHub Copilot Business then
         | Gemini Pro 2.5 and many others are available there. They have
         | guarantees that they don't share/store prompts or code and the
         | parent company is Microsoft. If you can argue that they will
         | save money (on saved developer time), what would be their
         | argument against?
        
           | otabdeveloper4 wrote:
           | > They have guarantees that they don't share/store prompts or
           | code
           | 
           | "They trust me. Dumb ..."
        
         | tasuki wrote:
         | > The amount of hoops I would have to jump through to get
         | either of the smaller companies I have worked for since the AI
         | boom to let me use a tool like this would make it absolutely
         | not worth the effort.
         | 
         | Define "smaller"? In small companies, say 10 people, there are
         | no hoops. That is the whole point of small companies!
        
       | ionwake wrote:
       | Is it possible to sue this with Cursor? If so what is the name of
       | the model? gemini-2.5-pro-preview ?
       | 
       | edit> Its gemini-2.5-pro-preview-05-06
       | 
       | edit>Cursor syas it doesnt have "good support" et, but im not
       | sure if this is a defualt message when it doesnt recognise a
       | model? is this a big deal? should I wait until its officially
       | supported by cursor?
       | 
       | Just trying to save time here for everyone - anyone know the
       | answer?
        
         | androng wrote:
         | At the bottom of the article it says no action is required and
         | the Gemini-2.5-pro-preview-03-25 now points to the new model
        
           | ionwake wrote:
           | well alot of action was required such as adding the model so
           | no idea what happened to the guy who wrote the article maybe
           | there is a new cursor update now
        
         | tough wrote:
         | Cursor UI sucks, it tells me to use -auto mode- to be faster,
         | but gemini 2.5 is way faster than any of the other free models,
         | so just selecting that one is faster even if the UI says
         | otherwise
        
           | ionwake wrote:
           | yeah ive noticed this too, like wtf would I use Auto?
        
         | bn-l wrote:
         | The one with exp in the name is free (you may have to add it
         | yourself) but they train on you. And after a certain limit it
         | becomes paid).
        
       | xbmcuser wrote:
       | As a non programmer Gemini 2.5 Pro I have been really loving this
       | for my python scripting for manipulating text and excel files for
       | web scraping. In the past I was able to use Chat Gpt to code some
       | of the things that I wanted but with Gemini 2.5 Pro it has been
       | just another level. If they improved it further that would be
       | amazing
        
       | djrj477dhsnv wrote:
       | I don't understand what I'm doing wrong.. it seems like everyone
       | is saying Gemini is better, but I've compared dozens of examples
       | from my work, and Grok has always produced better results.
        
         | redox99 wrote:
         | I haven't tested this release yet, but I found Gemini to be
         | overrated before.
         | 
         | My choice of LLMs was
         | 
         | Coding in cursor: Claude
         | 
         | General questions: Grok, if it fails then Gemini
         | 
         | Deep Research: Gemini (I don't have GPT plus, I heard it's
         | better)
        
         | dyauspitr wrote:
         | Anecdotally grok has been the worst of the bunch for me.
        
         | athoun wrote:
         | I agree, from my experience Grok gives superior coding results,
         | especially when modifying large sections of the codebase at
         | once such as in refactoring.
         | 
         | Although it's not for coding, I have noticed Gemini 2.5 pro
         | Deep Research has surpassed Grok's DeepSearch in thoroughness
         | and research quality however.
        
       | white_beach wrote:
       | object?
       | 
       | (aider joke)
        
       | llm_nerd wrote:
       | Their nomenclature is a bit confused. The Gemini web app has a
       | 2.5 Pro (experimental), yet this apparently is referring to 2.5
       | Pro Preview 05-06.
       | 
       | Would be ideal if they incremented the version number or the
       | like.
        
       | martinald wrote:
       | I'm totally lost again! If I use Gemini on the website
       | (gemini.google.com), am I using 2.5 Pro IO edition, or am I using
       | the old one?
        
         | disgruntledphd2 wrote:
         | Check the dropdown in the top left (on my screen,at least).
        
           | martinald wrote:
           | Are you referring to gemini.google.com or ai studio? I see
           | 2.5 Pro but is this the right one? I saw a tweet from them
           | saying you have to select Canvas first? I'm so so lost.
        
         | koakuma-chan wrote:
         | http://aistudio.google.com/app/prompts/new_chat?model=gemini...
        
           | martinald wrote:
           | I get this in AI studio, but does it apply to
           | gemini.google.com?
        
         | pzo wrote:
         | "The previous iteration (03-25) now points to the most recent
         | version (05-06), so no action is required to use the improved
         | model"
        
       | oellegaard wrote:
       | Is there anything like Claude code for other models such as
       | gemini?
        
         | mickeyp wrote:
         | I'm literally working on this particular problem. Locally-run
         | server; browser-based interface instead of TUI/CLI; connects to
         | all the major model APIs; many, many quality of life and
         | feature improvements over other tools that hook into your
         | browser.
         | 
         | Drop me a line (see profile) if you're interested in beta
         | testing it when it's out.
        
           | oellegaard wrote:
           | I'm actually very happy with everything in Claude code, eg
           | the CLI so im really just curious to try other models
        
             | revicon wrote:
             | Same! I prefer the CLI, way easier when I'm connected via
             | ssh from another network somewhere.
        
               | mickeyp wrote:
               | The CLI definitely has its advantages!
               | 
               | But with my app: you can install the host anywhere and
               | connect to it securely (via SSH forwarding or private VPN
               | or what have you) so that workflow definitely still
               | works!
        
             | Filligree wrote:
             | I find that 2.5 Pro has a higher ceiling of understanding,
             | while Claude writes more maintainable code with better
             | comments. If we want to combine them... well, it should be
             | easier to fix 2.5 than Claude. That said, neither is there
             | yet.
             | 
             | Currently Claude Code is a big value-add for Claude. Google
             | has nothing equivalent; aider requires far more manual
             | work.
        
         | alphabettsy wrote:
         | Aider
        
           | danielbln wrote:
           | Aider wasn't all that agentic last time I tried it, has that
           | changed?
        
         | elliot07 wrote:
         | OpenAi has a version called Codex that has support. It's
         | lacking in a few features like MCP right now and the TUI isn't
         | there yet, but interestingly they are building a Rust version
         | (it's all open source) that seems to include MCP support and
         | looks significantly higher quality. I'd bet within the next few
         | weeks there will be a high quality claude code alternative.
        
         | martythemaniak wrote:
         | Goose by Block (Square/CashApp) is like an open-source Claude
         | Code that works with any remote or local LLM.
         | 
         | https://github.com/block/goose
        
         | vunderba wrote:
         | Haven't tried it yet, but I've heard good things about Plandex.
         | 
         | https://github.com/plandex-ai/plandex
        
       | mliker wrote:
       | The "video to learning app" feature is a cool concept (see it in
       | AI Studio). I just passed in two separate Stanford lectures to
       | see if it could come up with an interesting interactive app. The
       | apps it generated weren't too useful, but I can see with more
       | focus and development, it'd be a game changer for education.
        
         | SparkyMcUnicorn wrote:
         | Anyone know of any coding agents that support video inputs?
         | 
         | Web chat interfaces are great, but copy/paste gets old fast.
        
         | lostmsu wrote:
         | I wonder how it processes video. Even individual pictures take
         | a lot of tokens.
        
       | brap wrote:
       | Gemini is now ranked #1 across every category in lmarena.
        
         | aoeusnth1 wrote:
         | LMArena is a joke, though
        
       | killerstorm wrote:
       | Why can't they just use version numbers instead of this "new
       | preview" stuff?
       | 
       | E.g. call it Gemini Pro 2.5.1.
        
         | lukeschlather wrote:
         | I take preview to mean the model may be retired on an
         | accelerated timescale and replaced with a "real" model so it's
         | dangerous to put into prod unless you are paying attention.
        
           | lolinder wrote:
           | They could still use version numbers for that. 2.5.1-preview
           | becomes 2.5.1 when stable.
        
           | danenania wrote:
           | Scheduled tasks in ChatGPT are useful for keeping track of
           | these kinds of things. You can have it check daily whether
           | there's a change in status, price, etc. for a particular
           | model (or set of models).
        
             | cdolan wrote:
             | I appriciate that you are trying to help
             | 
             | But I do not want to have to build a network of bots with
             | non-deterministic outputs to simply stay on top of versions
        
               | danenania wrote:
               | Neither do I, but it's the best solution I've found so
               | far. It beats checking models/prices manually every day
               | to see if anything has changed, and it works well enough
               | in practice.
               | 
               | But yeah, some kind of deterministic way to get alerts
               | would be better.
        
         | mhh__ wrote:
         | Are you saying you find model names like o4-mini-high-pro-
         | experimental-version5 confusing and stupid?
        
       | andy12_ wrote:
       | Interestingly, when compering benchmarks of Experimental 03-25
       | [1] and Experimental 05-06 [2] it seems the new version scores
       | slightly lower in everything except on LiveCodeBench.
       | 
       | [1] https://storage.googleapis.com/model-
       | cards/documents/gemini-... [2]
       | https://deepmind.google/technologies/gemini/
        
         | arnaudsm wrote:
         | This should be the top comment. Cherry-picking is hurting this
         | industry.
         | 
         | I bet they kept training on coding tasks, made everything worse
         | on the way, and tried to hide it under the rug because of the
         | sunk costs.
        
           | luckydata wrote:
           | Or because they realized that coding is what most of those
           | LLMs are used for anyways?
        
             | arnaudsm wrote:
             | They should have shown the benchmarks. Or market it as a
             | coding model, like Qwen & Mistral.
        
               | jjani wrote:
               | That's clearly not a PR angle they could possibly take
               | when it's replacing the overall SotA model. This is a
               | business decision, potentially inference cost related.
        
               | arnaudsm wrote:
               | From a business pov it's a great move, for the customers
               | it's evil to hide evidence that your product became
               | worse.
        
           | cma wrote:
           | They likely knew continued training on code would have some
           | amount of catastrophic forgetting on other stuff. They didn't
           | throw away the old weights so probably not sunk cost fallacy
           | going on, but since it is relatively new and they found out
           | X% of API token spend was on coding agents (where X is huge),
           | compared to what token spend distribution looked like on
           | prior Geminis that couldn't code well, they probably didn't
           | want the complexity and worse batching of having another
           | model for it if the impacts weren't too large and decided
           | they didn't weight coding enough initially and it is worth
           | the tradeoffs.
        
         | jjani wrote:
         | Sounds like they were losing so much money on 2.5-Pro they came
         | up with a forced update that made it cheaper to run. They can't
         | come out with "we've made it worse across the board", nor do
         | they want to be the first to actually raise prices, so instead
         | they made a bit of a distill that's slightly better at coding
         | so they can still spin it positively.
        
           | sauwan wrote:
           | I'd be surprised if this was a new base model. It sounds like
           | they just did some post-training RL tuning to make this
           | version specifically stronger for coding, at the expense of
           | other priorities.
        
             | jjani wrote:
             | Every frontier model now is a distill of a larger
             | unpublished model. This could be a slightly smaller
             | distill, with potentially the extra tuning you're
             | mentioning.
        
               | cubefox wrote:
               | That's an unsubstantiated claim. I doubt this is true,
               | since people are disproportionately more willing to pay
               | for the best of the best, rather than for something
               | worse.
        
               | tangjurine wrote:
               | Any info on this?
        
           | Workaccount2 wrote:
           | Google doesn't pay the nvidia tax. Their TPUs are designed
           | for Gemini and Gemini designed for their TPUs. Google is no
           | doubt paying far less per token than every other AI house.
        
         | merksittich wrote:
         | According to the article, "[t]he previous iteration (03-25) now
         | points to the most recent version (05-06)." I assume this
         | applies to both the free tier gemini-2.5-pro-exp-03-25 in the
         | API (which will be used for training) and the paid tier
         | gemini-2.5-pro-preview-03-25.
         | 
         | Fair enough, one could say, as these were all labeled as
         | preview or experimental. Still, considering that the new model
         | is slightly worse across the board in benchmarks (except for
         | LiveCodeBench), it would have been nice to have the option to
         | stick with the older version. Not everyone is using these
         | models for coding.
        
           | zurfer wrote:
           | Just switching a pinned version (even alpha, beta,
           | experimental, preview) to another model doesn't feel right.
           | 
           | I get it, chips are sparse and they want their capacity back,
           | but it breaks trust with developers to just downgrade your
           | model.
           | 
           | Call it gemini-latest and I understand that things will
           | change. Call it *-03-25 and I want the same model that I got
           | on 25th March.
        
         | nopinsight wrote:
         | Livebench.ai actually suggests the new version is better on
         | most things.
         | 
         | https://livebench.ai/#/
        
       | thevillagechief wrote:
       | I've been switching between this and GPT-4o at work, and Gemini
       | is really verbose. But I've been primarily using it. I'm confused
       | though, the model available in copilot says Gemini 2.5 Pro
       | (Preview), and I've had it for a few weeks. This was just
       | released today. Is this an updated preview? If so, the
       | blog/naming is confusing.
        
       | CSMastermind wrote:
       | Hasn't Gemini 2.5 Pro been out for a while?
       | 
       | At first I was very impressed with it's coding abilities,
       | switching off of Claud for it but recently I've been using GPT o3
       | which I find is much more concise and generally better at problem
       | solving when you hit an error.
        
         | spaceman_2020 wrote:
         | Think that was still the experimental model incorrectly labeled
         | by many platforms as "Pro"
        
           | 85392_school wrote:
           | That's inaccurate. First, there was the experimental 03-25
           | checkpoint. Then it was promoted to Preview without changing
           | anything. And now we have a new 05-06 checkpoint, still
           | called Gemini 2.5 Pro, and still in Preview.
        
       | laborcontract wrote:
       | My guess is that they've done a lot of tuning to improve diff
       | based code editing. Gemini 2.5 is fantastic at agentic work, but
       | it still is pretty rough around the edges in terms of generating
       | perfectly matching diffs to edit code. It's probably one of the
       | very few issues with the model. Luckily, aider tracks this.
       | 
       | They measure the old gemini 2.5 generating proper diffs 92% of
       | the time. I bet this goes up to ~95-98%
       | https://aider.chat/docs/leaderboards/
       | 
       | Question for the google peeps who monitor these threads: Is
       | gemini-2.5-pro-exp (free tier) updated as well, or will it go
       | away?
       | 
       | Also, in the blog post, it says:                 > The previous
       | iteration (03-25) now points to the most recent version (05-06),
       | so no action is required to use the improved model, and it
       | continues to be available at the same price.
       | 
       | Does this mean gemini-2.5-pro-preview-03-25 now uses 05-06? Does
       | the same apply to gemini-2.5-pro-exp-03-25?
       | 
       | update: I just tried updating the date in the exp model
       | (gemini-2.5-pro-exp-05-06) and that doesnt work.
        
         | okdood64 wrote:
         | What do you mean by agentic work in this context?
        
           | laborcontract wrote:
           | Knowing when to call functions, generating the proper
           | function calling text structure, properly executing functions
           | in sequence, knowing when it's completed its objective, and
           | doing that over an extended context window.
        
         | laborcontract wrote:
         | Update 2: I've been using this model in both aider and cline
         | and I've haven't gotten a diff matching error yet, even with
         | some pretty difficult substitutions across different places in
         | multiple files. The overall feel of this model is nice.
         | 
         | I don't have a formal benchmark but there's a notable
         | improvement in code generation due to this alone.
         | 
         | I've had gemini chug away on plans that have taken ~1 hour to
         | implement. (~80mln tokens spent) A good portion of that energy
         | was spent fixing mistakes made by cline/aider/roo due to
         | search/replace mistakes. If this model gets anywhere close to
         | 100% on diffs then this is a BFD. I estimate this will
         | translate to a 50-75% productivity boost on long context coding
         | tasks. I hope the initial results i'm seeing hold up!
         | 
         | I'm surprised by the reaction in the rest of the thread. A lot
         | unproductive complaining, a lot of off topic stuff, nothing
         | talking about the model itself.
         | 
         | Any thoughts from anyone else using the updated model?
        
       | EliasWatson wrote:
       | I wonder how the latest version of Grok 3 would stack up to
       | Gemini 2.5 Pro on the web dev arena leaderboard. They are still
       | just showing the original early access model for some reason,
       | despite there being API access to the latest model. I've been
       | using Grok 3 with Aider Chat and have been very impressed with
       | it. I get $150 of free API credits every month by allowing them
       | to train on my data, which I'm fine with since I'm just working
       | on personal side projects. Gemini 2.5 Pro and Claude 3.7 might be
       | a little better than Grok 3, but I can't justify the cost when
       | Grok doesn't cost me a penny to use.
        
       | mohsen1 wrote:
       | I use Gemini for almost everything. But their model card[1] only
       | compares to o3-mini! In known benchmarks o3 is still ahead:
       | +------------------------------+---------+--------------+
       | |         Benchmark            |   o3    | Gemini 2.5   |
       | |                              |         |    Pro       |
       | +------------------------------+---------+--------------+
       | | ARC-AGI (High Compute)       |  87.5%  |     --        |
       | | GPQA Diamond (Science)       |  87.7%  |   84.0%      |
       | | AIME 2024 (Math)             |  96.7%  |   92.0%      |
       | | SWE-bench Verified (Coding)  |  71.7%  |   63.8%      |
       | | Codeforces Elo Rating        |  2727   |     --        |
       | | MMMU (Visual Reasoning)      |  82.9%  |   81.7%      |
       | | MathVista (Visual Math)      |  86.8%  |     --        |
       | | Humanity's Last Exam         |  26.6%  |   18.8%      |
       | +------------------------------+---------+--------------+
       | 
       | [1] https://storage.googleapis.com/model-
       | cards/documents/gemini-...
        
         | jsnell wrote:
         | The text in the model card says the results are from March
         | (including the Gemini 2.5 Pro results), and o3 wasn't released
         | yet.
         | 
         | Is this maybe not the updated card, even though the blog post
         | claims there is one? Sure, the timestamp is in late April, but
         | I seem to remember that the first model card for 2.5 Pro was
         | only released in the last couple of weeks.
        
         | cbg0 wrote:
         | o3 is $40/M output tokens and 2.5 Pro is $10-15/M output tokens
         | so o3 being slightly ahead is not really worth 4 times more
         | than gemini.
        
           | jorl17 wrote:
           | Also, o3 is insanely slow compared to Gemini 2.5 Pro
        
           | i_have_an_idea wrote:
           | Not sure why this is being downvoted, but it's absolutely
           | true.
           | 
           | If you're using these models to generate code daily, the
           | costs add up.
           | 
           | Sure, I'll give a really tough problem to o3 (and probably
           | over ChatGPT, not the API), but on general code tasks, there
           | really isn't meaningful enough difference to justify 4x the
           | cost.
        
       | gitroom wrote:
       | man that endless commenting seriously kills my flow - gotta say,
       | even after all the prompts and hacks, still can't get these
       | models to chill out. you think we'll ever get ai to stop
       | overdoing it and actually fit real developer habits or is it
       | always gonna be like this?
        
       | arnaudsm wrote:
       | Be careful, this model is worse than 03-25 in 10 of the 12
       | benchmarks (!)
       | 
       | I bet they kept training on coding, made everything worse on the
       | way, and tried to hide it under the rug because of the sunk
       | costs.
        
         | jstummbillig wrote:
         | It seems that trying to build llms is the definition of
         | accepting sunk cost.
        
       | nashashmi wrote:
       | I keep hearing good things about Gemini online and offline. I
       | wrote them off as terrible when they first launched and have not
       | looked back since.
       | 
       | How are they now? Sufficiently good? Competent? Competitive? Or
       | limited? My needs are very consumer oriented, not programming/api
       | stuff.
        
         | hmate9 wrote:
         | Probably the best one right now, their deep research is also
         | very good.
        
         | danielbln wrote:
         | Bard sucked, Gemini sucked, Gemini 2 was alright, 2.5 is
         | awesome and my main driver for coding these days.
        
         | thevillagechief wrote:
         | The Gemini deep research is a revelation. I obsessively
         | research most things I buy, from home appliances to gym
         | equipment. It has literally saved untold hours of comparisons.
         | You get detailed reports generated from every website including
         | youtube reviews. I've bought a bunch of stuff on it's
         | recommendation.
        
           | Imanari wrote:
           | care to share your search prompt?
        
       | ramoz wrote:
       | Never sleep on Google.
        
       | panarchy wrote:
       | Is it just me that finds that while Gemini 2.5 is able to
       | generate a lot of code that the end results are usually
       | lackluster compared to Claude and even ChatGPT? I also find it
       | hard-headed and frequently does things in ways I explicitly told
       | it not to. The massive context window is pretty great though and
       | enables me to do things I can't with the others so it still gets
       | used a lot.
        
         | scrlk wrote:
         | How are you using it?
         | 
         | I find that I get the best results from 2.5 Pro via Google AI
         | Studio with a low temperature (0.2-0.3).
        
           | panarchy wrote:
           | AI Studio as well, but I haven't played around with the
           | temperature too much and even then I only lowered it to like
           | 0.8 a few times. So I'll have to try this out. Thanks.
        
       | xyst wrote:
       | Proprietary junk beats DeepSeek by a mere 213 points?
       | 
       | Oof. G and others are way behind
        
       | childintime wrote:
       | How does it perform on anything but Python and Javascript? In my
       | experience my milage varied a lot when using C#, for example, or
       | Zig, so I've learnt to just let it select the language it wants.
       | 
       | Also, why doesn't Ctrl+C work??
        
         | scbenet wrote:
         | It's very good at Go, which makes sense because I'm assuming
         | it's trained on a lot of Google's code
        
           | simianwords wrote:
           | How would they train it on google code without revealing
           | internal IP?
        
       | obsolete_wagie wrote:
       | o3 is so far ahead of antrhopic and google, these models arent
       | even worth using
        
         | Workaccount2 wrote:
         | o3 is expensive in the API and intentionally crippled in the
         | web app.
        
         | Squarex wrote:
         | source?
        
           | obsolete_wagie wrote:
           | use the models daily, its not even close
        
         | mattlondon wrote:
         | The benchmarks (1) seem to suggest that o3 is in 3rd place
         | after Gemini 2.5 pro preview and Gemini 2.5 pro exp (for text
         | reasoning, o3 4th for webdev). o3 doesn't even appear on the
         | openrouter leaderboards (2) suggesting is hardly used (if at
         | all) by anyone using LLMs do actually _do_ anything (such as
         | coding) which makes one question if it is actually any good at
         | all (otherwise if it was so great I 'd expect to see heavy
         | usage)
         | 
         | Not sure where your data is coming from but everything else is
         | pointing to Google supremacy in AI right now. I look forward to
         | some new models from Anthropic, xAi, Meta et al (remains to be
         | seen if OpenAI has anything left apart from bluster). Exciting
         | times.
         | 
         | 1 - https://beta.lmarena.ai/leaderboard
         | 
         | 2 - https://openrouter.ai/rankings
        
           | obsolete_wagie wrote:
           | you just arent using the models to their full capacity if you
           | think this, benchmarks have all been hacked
        
         | cellis wrote:
         | 8x the cost for maybe 5% improvement?
        
         | epolanski wrote:
         | Not my experience, at all.
         | 
         | I have long stopped using OpenAI products, and all oX have been
         | letdowns.
         | 
         | For coding it has been Claude 3.5 -> 3.7 -> Gemini 2.5 for me.
         | For general use it has been chatgpt -> Gemini.
         | 
         | Google has retaken the ML crown for my use cases and it keeps
         | getting better.
         | 
         | Gemini 2.0 flash was also the first LLM I put in production,
         | because for my use case (summarizing news articles and
         | translate them) it was way too fast, accurate and cheap to
         | ignore whereas ChatGPT was consistently too slow and expensive
         | to be even considered.
        
       | ionwake wrote:
       | Can someone tell me if windsurf is better than cursor? ( pref
       | someone who has used both for a few days? )
        
         | kurtis_reed wrote:
         | Relevance?
        
           | ionwake wrote:
           | its what literally every hn coder is using to program with
           | these models much as gemini.where u been brother
        
         | ramoz wrote:
         | Claude Code and its not close. I feed my entire project to
         | gemini for planning and figuring out complex solutions for
         | claude code to execute on. I use Prompt Tower for building
         | entire codebase prompts for gemini.
        
           | ionwake wrote:
           | fantastic reply thanks, can I ask if you have tried cursor? I
           | use to use claudecode but it was super expensive and got
           | stuck in loops. ( I know it is cheaper now). Do you have any
           | thoughts?
        
             | ramoz wrote:
             | I spend the money on Claude Code, and don't think twice.
             | I've spent low 1,000s at this point but the return is
             | justified.
             | 
             | I use Cursor when I code myself. But I don't use it's chat
             | or agent features. I had replaced VS Code with it but at
             | this point I could go back to VS Code, but I'm lazy.
             | 
             | Cursor agent/chat we're fine if you're bottlenecked by
             | money. I have no idea why or how it uses things like the
             | codebase embedding. An agent on top of a filesystem is a
             | powerful thing. People also like Aider and RooCode for the
             | CLI experience and I think they are affordable.
             | 
             | To make the most use of these things, you need to guide
             | them and provide them adequate context for every task. For
             | Claude Code I have built a meta management framework that
             | works really well. If I were forced to use cursor I would
             | use the same approach.
        
       | m_kos wrote:
       | [Tangent] Anyone here using 2.5 Pro in Gemini Advanced? I have
       | been experiencing a ton of bugs, e.g.,:
       | 
       | - [codes] showing up instead of references,
       | 
       | - raw search tool output sliding across the screen,
       | 
       | - Gemini continusly answering questions asked two or more
       | messages before but ignoring the most recent one (you need to ask
       | Gemini an unrelated question for it to snap out of this bug for a
       | few minutes),
       | 
       | - weird messages including text irrelevant to any of my chats
       | with Gemini, like baseball,
       | 
       | - confusing its own replies with mine,
       | 
       | - not being able to run its own Python code due to some
       | unsolvable formatting issue,
       | 
       | - timeouts, and more.
        
         | Dardalus wrote:
         | The Gemini app is absolute dog doo... use it through AI studio.
         | Google ought to shut down the entire Gemini app.
        
       | paulirish wrote:
       | > Gemini 2.5 Pro now ranks #1 on the WebDev Arena leaderboard
       | 
       | It'd make sense to rename WebDev Arena to React/Tailwind Arena.
       | Its system prompt requires [1] those technologies and the entire
       | tool breaks when requesting vanilla JS or other frameworks. The
       | second-order implications of models competing on this narrow
       | definition of webdev are rather troublesome.
       | 
       | [1] https://blog.lmarena.ai/blog/2025/webdev-
       | arena/#:~:text=PROM...
        
         | martinsnow wrote:
         | Bwoah it's almost as if react and tailwind is the bees knees
         | ind frontend atm
        
           | byearthithatius wrote:
           | Sadly. Tailwind is so oof in my opinion. Lets import
           | megabytes just so we don't have to write 5 whole CSS classes.
           | I mean just copy paste the code.
           | 
           | Don't get me stared on how ugly the HTML becomes when most
           | tags have 20 f*cking classes which could have been two.
        
             | johnfn wrote:
             | In most reasonably-sized websites, Tailwind will decrease
             | overall bundle size when compared to other ways of writing
             | CSS. Which is less code, 100 instances of "margin-left:
             | 8px" or 100 instances of "ml-2" (and a single definition
             | for ml-2)? Tailwind will dead-code eliminate all rules
             | you're not using.
             | 
             | In typical production environments tailwind is only around
             | 10kb[1].
             | 
             | [1]: https://v3.tailwindcss.com/docs/optimizing-for-
             | production
        
         | postalrat wrote:
         | I've found them to be pretty good with vanilla html and css.
        
         | shortcord wrote:
         | Not a fan of the dominance of shadcn and Tailwind when it comes
         | to generating greenfield code.
        
           | BoorishBears wrote:
           | shadcn/ui is such a terrible thing for the frontend
           | ecosystem, and it'll get _even worse_ for it as AI gets
           | better.
           | 
           | Instead of learnable, stable, APIs for common components with
           | well established versioning and well defined tokens, we've
           | got people literally copying and pasting components and
           | applying diffs so they can claim they "own them".
           | 
           | Except the vast majority of them don't ever change a line and
           | just end up with a strictly worse version of a normal package
           | (typically out of date or a hodgepodge of "versions" because
           | they don't want to figure out diffs), and the few that do
           | make changes don't have anywhere _near_ the design sense to
           | be using shadcn since there aren 't enough tokens to keep the
           | look and feel consistent across components.
           | 
           | The would be 1% who would change it _and_ have their own well
           | thought out design systems don 't get a lift from shadcn
           | either vs just starting with Radix directly.
           | 
           | -
           | 
           | Amazing spin job though with the "registry" idea too: "it's
           | actually very good for AI that we invented a parallel
           | distribution system for ad-hoc components with no standard
           | except a loose convention around sticking stuff in a folder
           | called ui"
        
         | aero142 wrote:
         | If llms are able to write better code with more declarative and
         | local programming components and tailwind, then I could imagine
         | a future where a new programming language is created to
         | maximize llm success.
        
           | epolanski wrote:
           | This so much.
           | 
           | To me it seems so strange that few good language designers
           | and ml folks didn't group together to work on this.
           | 
           | It's clear that there is a space for some LLM meta language
           | that could be designed to compile to bytecode, binary, JS,
           | etc.
           | 
           | It also doesn't need to be textual like we code, but some
           | form of AST llama can manipulate with ease.
        
             | LZ_Khan wrote:
             | readability would probably be the sticking point
        
             | senbrow wrote:
             | At that point why not just have LLMs generate bytecode in
             | one shot?
             | 
             | Plenty of training data to go on, I'd imagine.
        
             | seb1204 wrote:
             | Would this be addressed by better documentation of code and
             | APIs as well as examples? All this would go into the
             | training materials and then be the body of knowledge.
        
         | nicce wrote:
         | > It'd make sense to rename WebDev Arena to React/Tailwind
         | Arena.
         | 
         | Funnily, training of these models feels getting cut mid of
         | v3/v4 Tailwind release, and Gemini always try to correct my
         | mistakes (... use v3 instead of v4)
        
       | qwertox wrote:
       | I have my issues with the code Gemini Pro in AI Studio generates
       | without customized "System Instructions".
       | 
       | It turns a well readable code-snippet of 5 lines into a 30 line
       | snippet full of comments and mostly unnecessary error handling.
       | Code which becomes harder to reason about.
       | 
       | But for sysadmin tasks, like dealing with ZFS and LVM, it is
       | absolutely incredible.
        
         | bn-l wrote:
         | I've found the same thing. I don't use it for code any more
         | because it produces highly verbose and inefficient code that
         | may work but is ugly and subtly brittle.
        
       | mvdtnz wrote:
       | I truly do not understand how people are getting worthwhile
       | results from Gemini 2.5 Pro. I have used all of the major models
       | for lots of different programming tasks and I have never once had
       | Gemini produce something useful. It's not just wrong, it's
       | laughably bad. And people are making claims that it's the best. I
       | just... don't... get it.
        
         | WaltPurvis wrote:
         | That's weird. What languages/frameworks/tasks are you using it
         | for? I've been using Gemini 2.5 with Dart recently and it
         | frequently produces indisputably useful code, and indisputably
         | helpful advice. Along with some code that's pretty dumb or
         | misguided, and some advice that would be counterproductive if I
         | actually followed it. But "never once had Gemini produce
         | something useful" is _wildly_ different from my recent
         | experience.
        
       | franze wrote:
       | I like it. I threw some random concepts at it (Neon, LSD,
       | Falling, Elite, Shooter, Escher + Mobile Game + SPA) at it and
       | this is what it came up with after a few (5x) roundtrips.
       | 
       | https://show.franzai.com/a/star-zero-huge?nobuttons
        
       | cadamsdotcom wrote:
       | Google/Alphabet is a giant hulking machine that's been frankly
       | running at idle. All that resume driven development and
       | performance review promo cycles and retention of top talent
       | mainly to work on ad tech means it's packed to the rafters with
       | latent capability. Holding on to so much talent in the face of
       | basically having nothing to do is a testament to the company's
       | leadership - even if said leadership didn't manage to make Google
       | push humanity forward over the last decade or so.
       | 
       | Now there's a big nugget to chew (LLMs) you're seeing that latent
       | capability come to life. This awakening feels more bottom-up
       | driven than top down. Google's a war machine chugging along
       | nicely in peacetime, but now its war again!
       | 
       | Hats off to the engineers working on the tech. Excited to try it
       | out!
        
         | kccqzy wrote:
         | > retention of top talent mainly to work on ad tech
         | 
         | No the top talent worked on exciting things like Fuchsia. Ad
         | tech is boring stuff written by people who aren't enough of a
         | snob to refuse working on ad tech.
        
           | cadamsdotcom wrote:
           | Top talent worked on what now?
           | 
           | Isn't that a flower?
           | 
           | (Hopefully you see my point)
        
       | alana314 wrote:
       | The google sheets UI asked me to try Gemini to create a formula,
       | so I tried it, starting with "Create a formula...", and its
       | answer was "Sorry, I can't help with creating formulas yet, but
       | I'm still learning."
        
       | wewewedxfgdf wrote:
       | Gemini does not accept upload of TSX files, it says "File type
       | unsupported"
       | 
       | You must _rename your files to .tsx.txt THEN IT ACCEPTS THEM_ and
       | works perfectly fine writing TSX code.
       | 
       | This is absolutely bananas. How can such a powerful coding engine
       | have this behavior?
        
         | krat0sprakhar wrote:
         | Where are you testing this? I'm able to upload tsx files on
         | aistudio
        
           | wewewedxfgdf wrote:
           | https://gemini.google.com/app
        
       | jmward01 wrote:
       | Google's models are pretty good, but their API(s) and guarantees
       | aren't. We were just told today that 'quota doesn't guarantee
       | capacity' so basically on-demand isn't prod capable. Add to that
       | that there isn't a second vendor source like Anthropic and OpenAI
       | have and Google's reliability makes it a hard sell to use them
       | unless you can back up the calls with a different model family
       | all together.
        
       | simonw wrote:
       | Here's a summary of the 394 comments on this post created using
       | the new gemini-2.5-pro-preview-05-06. It looks very good to me -
       | well grouped, nicely formatted.
       | 
       | https://gist.github.com/simonw/7ef3d77c8aeeaf1bfe9cc6fd68760...
       | 
       | 30,408 input, 8,535 output = 12.336 cents.
       | 
       | 8,500 is a very long output! Finally a model that obeys my
       | instructions to "go long" when summarizing Hacker News threads.
       | Here's the script I used:
       | https://gist.github.com/simonw/7ef3d77c8aeeaf1bfe9cc6fd68760...
        
       ___________________________________________________________________
       (page generated 2025-05-06 23:00 UTC)