[HN Gopher] Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison
       ___________________________________________________________________
        
       Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison
        
       Author : mraniki
       Score  : 399 points
       Date   : 2025-03-31 12:09 UTC (10 hours ago)
        
 (HTM) web link (composio.dev)
 (TXT) w3m dump (composio.dev)
        
       | mraniki wrote:
       | TL;DR
       | 
       | If you want to jump straight to the conclusion, I'd say go for
       | Gemini 2.5 Pro, it's better at coding, has one million in context
       | window as compared to Claude's 200k, and you can get it for free
       | (a big plus). However, Claude's 3.7 Sonnet is not that far
       | behind. Though at this point there's no point using it over
       | Gemini 2.5 Pro.
        
         | kingkongjaffa wrote:
         | How are you getting gemini 2.5 pro for free?
         | 
         | In the gemini iOS app the only available models are currently
         | 2.0 flash and 2.0 flash thinking.
        
           | lyjackal wrote:
           | https://aistudio.google.com
        
           | diggan wrote:
           | > How are you getting gemini 2.5 pro for free?
           | 
           | I think the "AI Premium" plan of Google One includes access
           | to all the models, including the latest ones (at least that's
           | what it says for me in Spain): https://one.google.com/plans
        
           | HarHarVeryFunny wrote:
           | They just added it to the free tier today.
        
             | simonjulianl wrote:
             | Yup, you can go navigate to https://gemini.google.com >
             | choose 2.5 Pro (experimental).
        
         | dsincl12 wrote:
         | Not sure what happened with Claude 3.7, but 3.5 is way better
         | in all things day to day. 3.7 felt like a major step back
         | especially when it comes to coding even though this was
         | highlighted as one aspect they improved upon. 500k window will
         | soon be released for Claude. Not sure much it will improve
         | anything though.
        
           | quesomaster9000 wrote:
           | With Claude 3.7 I keep having to remind it about things, and
           | go back and correct it several times in a row, before
           | cleaning the code up significantly.
           | 
           | For example, yesterday I wanted to make a 'simple' time
           | format, tracking Earths orbits of the Sun, the Moons orbits
           | of Earth and rotations of Earth from a specific given point
           | in time (the most recent 2020 great conjunction) - without
           | directly using any hard-coded constants other than the
           | orbital mechanics and my atomic clock source. Where this
           | would be in the format of `S4.7.... L52... R1293...` for
           | sols, luns & rotations.
           | 
           | I keep having to remind to to go back to first principles, we
           | want actual rotations, real day lengths etc. rather than
           | hard-coded constants that approximate the mean over the year.
        
         | polycaster wrote:
         | If there'd just be an alternative to claude code...
        
           | Jowsey wrote:
           | Isn't https://aider.chat similar?
        
         | diggan wrote:
         | > has one million in context window
         | 
         | Is this _effective_ context window or just the absolute limit?
         | A lot of the models that claim to support very large context
         | windows cannot actually successfully do the typical  "needle in
         | a haystack" test, but I'm guessing there are published results
         | somewhere demonstrating Gemini 2.5 Pro can actually find the
         | needle?
        
           | oidar wrote:
           | This is a good question. There's a big difference in being
           | able to write coherent code and "needle in the haystack"
           | questions. I've found that Claude is able to do the needle in
           | the haystack questions just fine with a large context, but
           | not so with coding. You have to work to keep the context low
           | (around 15% to 20% in projects) to get coherent code that
           | doesn't confabulate.
        
           | llm_nerd wrote:
           | Google has had almost perfect recall in the needle in the
           | haystack test since 1.5[1], achieving close to 100% over the
           | entire context window. I can't provide a link benchmarking
           | 2.5 Pro in particular, but this has been a solved problem
           | with Google models so I assume the same is true with their
           | new model.
           | 
           | [1] https://cloud.google.com/blog/products/ai-machine-
           | learning/t...
        
             | diggan wrote:
             | Has those results been reproduced elsewhere with other
             | benchmarks than what Google seems to use?
             | 
             | Hard to trust their own benchmarks at this point, and Im
             | not home at the moment so cant try it myself either.
        
               | llm_nerd wrote:
               | They are testing for a very straightforward needle
               | retrieval, as LLMs traditionally were terrible for this
               | in longer contexts.
               | 
               | There are some more advanced tests where it's far less
               | impressive. Just a couple of days ago Adobe released one
               | such test- https://github.com/adobe-research/NoLiMa
        
         | MITSardine wrote:
         | What does this context window mean, is it the size of the
         | prompt it can be made aware of?
         | 
         | In practice, can you use any of these models with existing code
         | bases of, say, 50k LoC?
        
       | bratao wrote:
       | From my use case, the Gemini 2.5 is terrible. I have a complex
       | Cython code in a single file (1500 lines) for a Sequence
       | Labeling. Claude and o3 are very good in improving this code and
       | following the commands. The Gemini always try to do unrelated
       | changes. For example, I asked, separately, for small changes such
       | as remove this unused function, or cache the arrays indexes.
       | Every time it completely refactored the code and was obsessed
       | with removing the gil. The output code is always broken, because
       | removing the gil is not easy.
        
         | fl_rn_st wrote:
         | This reflects my experience 1:1... even telling 2.5 Pro to
         | focus on the tasks given and ignore everything else leads to it
         | changing unrelated code. It's a frustrating experience because
         | I believe at its core it is more capable than Sonnet 3.5/3.7
        
         | ldjkfkdsjnv wrote:
         | Yup, gemini 2.5 is bad.
        
           | itchyjunk wrote:
           | Were you also trying to edit the same code base as the GP or
           | did you evaluate it on some other criteria where it also
           | failed?
        
             | ldjkfkdsjnv wrote:
             | I take the same prompt and give it to 3.7, o1 pro, and
             | gemini. I do this for almost everything, and these are
             | large 50k+ context prompts. Gemini is almost always behind
        
         | ekidd wrote:
         | How are you asking Gemini 2.5 to change existing code? With
         | Claude 3.7, it's possible to use Claude Code, which gets
         | "extremely fast but untrustworthy intern"-level results. Do you
         | have a prefered setup to use Gemini 2.5 in a similar agentic
         | mode, perhaps using a tool like Cursor or aider?
        
           | bratao wrote:
           | For all LLMs, I'm using a simple prompt with the complete
           | code in triple quotes and the command at the end, asking to
           | output the complete code of changed functions. Then I use
           | Winmerge to compare the changes and apply. I feel more
           | confident doing this than using Cursor.
        
             | pests wrote:
             | Should really check out aider. Automates this but also does
             | things like make a repo map of all your functions /
             | signatures for non-included files so it can get more
             | context.
        
         | redog wrote:
         | For me I had to upload the library's current documentation to
         | it because it was using outdated references and changing
         | everything that was working in the code to broken and not
         | focusing on the parts I was trying to build upon.
        
           | amarcheschi wrote:
           | using outdated references and docs is something i've
           | experienced more or less with every model i've tried, from
           | time to time
        
             | rockwotj wrote:
             | I am hoping MCP will fix this. I am building an MCP
             | integration with kapa.ai for my company to help devs here.
             | I guess this doesn't work if you don't add in the tool
        
             | simonw wrote:
             | That's expected, because they almost all have training cut-
             | off dates from a year ago or longer.
             | 
             | The more interesting question is if feeding in carefully
             | selected examples or documentation covering the new library
             | versions helps them get it right. I find that to usually be
             | the case.
        
           | Jcampuzano2 wrote:
           | If you don't mind me asking how do you go about this?
           | 
           | I hear people commonly mention doing this but I can't imagine
           | people are manually adding every page of the docs for
           | libraries or frameworks they're using since unfortunately
           | most are not in one single tidy page easy to copy paste.
        
             | dr_kiszonka wrote:
             | If you have access to the documentation source, you can
             | concatenate all files into one. Some software also has docs
             | downloadable as PDF.
        
             | genewitch wrote:
             | Have the AI write a quick script using bs4 or whatever to
             | take the HTML dump and output json, then all the aider-
             | likes can use that json as documentation. Or just the HTML,
             | but that wastes context window.
        
             | SweetSoftPillow wrote:
             | https://github.com/mufeedvh/code2prompt
             | https://github.com/yamadashy/repomix
        
         | hyperbovine wrote:
         | Maybe the Unladen Swallow devs ended up on the Gemini team.
        
         | dagw wrote:
         | That matches my experience as well. Gemini 2.5 Pro seems better
         | at writing code from scratch, but Claude 3.7 seems much better
         | at refactoring my existing code.
         | 
         | Gemini also seems more likely to come up with 'advanced' ideas
         | (for better or worse). I for example asked both for a fast C++
         | function to solve an on the surface fairly simple computational
         | geometry problem. Claude solved it in a straight ahead and
         | obvious way. Nothing obviously inefficient, will perform
         | reasonably well for all inputs, but also left some performance
         | on the table. I could also tell at a glance that it was almost
         | certainly correct.
         | 
         | Gemini on the other hand did a bunch of (possibly) clever
         | 'optimisations' and tricks, plus made extensive use of OpenMP.
         | I know from experience that those optimisations will only be
         | faster if the input has certain properties, but will be a
         | massive overhead in other, quite common, cases.
         | 
         | With a bit more prompting and questions from my part I did
         | manage to get both Gemini and Claude to converge on pretty much
         | the same final answer.
        
         | pests wrote:
         | > The Gemini always try to do unrelated changes. For example, I
         | asked, separately, for small changes such as remove this unused
         | function
         | 
         | For anything like this, I don't understand trying to invoke AI.
         | Just open the file and delete the lines yourself. What is AI
         | going to do here for you?
         | 
         | It's like you are relying 100% on AI when it's a tool in your
         | toolset.
        
           | joshmlewis wrote:
           | Playing devils advocate here, it's because removing a
           | function is not always as simple as deleting the lines.
           | Sometimes there are references to that function that you
           | forgot about that the LLM will notice and automatically
           | update for you. Depending on your prompt it will also go find
           | other references outside of the single file and remove those
           | as well. Another possibility is that people are just becoming
           | used to interacting with their codebase through the "chat"
           | interface and directing the LLM to do things so that behavior
           | carries over into all interactions, even perceived "simple"
           | ones.
        
             | matsemann wrote:
             | Any IDE will do this for you a hundred times better than
             | current LLMs.
        
           | Fr3ck wrote:
           | I like to code with an LLMs help making iterative changes.
           | First do this, then once that code is a good place, then do
           | this, etc. If I ask it to make one change, I want it to make
           | one change only.
        
         | therealmarv wrote:
         | set temperature to 0.4 or lower.
        
           | mrinterweb wrote:
           | Adjusting temperature is something I often forget. I think
           | Gemini can range between 0.0 <-> 2.0 (1.0 default). Lowering
           | the temp should get more consistent/deterministic results.
        
         | kristopolous wrote:
         | I mean it's really in how you use it.
         | 
         | The focus on benchmarks affords a tendency to generalize
         | performance as if it's context and user independent.
         | 
         | Each model really is a different piece of software with
         | different capabilities. Really fascinating to see how
         | dramatically different people's assessments are
        
         | rom16384 wrote:
         | You can fix this using a system prompt to force it to reply
         | just with a diff. It makes the generation much faster and much
         | less prone to changing unrelated lines. Also try reducing the
         | temperature to 0.4 for example, I find the default temperature
         | of 1 too high. For sample system prompts see Aider Chat:
         | https://github.com/Aider-AI/aider/blob/main/aider/coders/edi...
        
       | kingkongjaffa wrote:
       | Is there a less biased discussion?
       | 
       | The OP link is a thinly veiled and biased advert for something
       | called composio and really a biased and overly flowery view of
       | Gemini 2.5 pro.
       | 
       | Example:
       | 
       | "Everyone's talking about this model on Twitter (X) and YouTube.
       | It's trending everywhere, like seriously. The first model from
       | Google to receive such fanfare.
       | 
       | And it is #1 in the LMArena just like that. But what does this
       | mean? It means that this model is killing all the other models in
       | coding, math, Science, Image understanding, and other areas."
        
         | tempoponet wrote:
         | I don't see it.
         | 
         | Composio is a tool to help integration of LLM tool calling /
         | MCPs. It really helped me streamline setting up some MCPs with
         | Claude desktop.
         | 
         | I don't see how pushing Gemini would help their business beyond
         | encouraging people to play with the latest and greatest models.
         | There's a 1 sentence call-to-action at the end which is pretty
         | tame for a company blog.
         | 
         | The examples don't even require you to use Composio - they're
         | just talking about prompts fed to different models, not even
         | focused on tool calling, MCPs, or the Composio platform.
        
           | ZeroTalent wrote:
           | I believe their point was that they are writing about what
           | people want to read (a new AI breakthrough), possibly
           | embellishing or cherry-picking results, although we can't
           | prove/disprove it easily.
           | 
           | This approach yields more upvotes and views on their website,
           | which ultimately leads to increased conversions for their
           | tool.
        
         | viscanti wrote:
         | If it's not astroturfing, the people who are so vocal about it
         | act in a way that's nearly indistinguishable from it. I keep
         | looking for concrete examples of use cases that show it's
         | better, and everything seems to point back to "everyone is
         | talking about it" or anecdotal examples that don't even provide
         | any details about the problem that Gemini did well on and that
         | other models all failed at.
        
           | lionkor wrote:
           | If I give you hundreds millions of dollars for just making a
           | clone of something that exists (an LLM) and hype the shit out
           | of it, how far would you go?
        
             | throwup238 wrote:
             | I would change the world(tm) and make it a better place(r).
        
               | genewitch wrote:
               | Empowering everyone to bring their ideas to life
        
         | Analemma_ wrote:
         | Zvi Moshowitz's blog [0] is IME a pretty good place to keep
         | track of the state of things, it's well-sourced and in-depth
         | without being either too technical or too vibes-based.
         | Generally every time a model is declared the new best you can
         | count on him to have a detailed post examining the claim within
         | a couple days.
         | 
         | [0]: https://thezvi.substack.com/
        
       | anonzzzies wrote:
       | For Gemini: play around with the temperature: the default is
       | terrible: we had much better results with (much) lower values.
        
         | SubiculumCode wrote:
         | What improved, specifically?
        
           | anonzzzies wrote:
           | Much better code.
        
         | CjHuber wrote:
         | From my experience a temperature close to 0 creates the best
         | code (meaning functioning without modifications). When vibe
         | coding I now use a very high temperature for brainstorming and
         | writing specifications, and then have the code written at a
         | very low one.
        
       | qwertox wrote:
       | Gemini is the only model which tells me when it's a good time to
       | stop chatting because either it can't find a solution or because
       | it dislikes my solution (when I actively want to neglect
       | security).
       | 
       | And the context length is just amazing. When ChatGPT's context is
       | full, it totally forgets what we were chatting about, as if it
       | would start an entirely new chat.
       | 
       | Gemini lacks the tooling, there ChatGPT is far ahead, but at its
       | core, Gemini feels like a better model.
        
         | FirmwareBurner wrote:
         | _> Gemini is the only model which tells me when it's a good
         | time to stop chatting because either it can't find a solution
         | or because it dislikes my solution_
         | 
         | Claude used to also do that. Only ChatGPT starts falling apart
         | when I start to question it then gives in and starting to give
         | me mistakes as answers just to please me.
        
         | davedx wrote:
         | I'm still using ChatGPT heavily for a lot of my day-to-day,
         | across multiple projects and random real life tasks. I'm
         | interested in giving Claude and Gemini a good at some point;
         | where is Gemini's tooling lacking, generally?
        
         | criddell wrote:
         | I asked Claude this weekend what it could tell me about writing
         | Paint.Net plugins and it responded that it didn't know much
         | about that:
         | 
         | > I'd be happy to help you with information about writing
         | plugins for Paint.NET. This is a topic I don't have extensive
         | details on in my training, so I'd like to search for more
         | current information. Would you like me to look up how to create
         | plugins for Paint.NET?
        
           | qwertox wrote:
           | I mean responses like this one:                 I understand
           | the desire for a simple or unconventional solution, however
           | there are problems with those solutions.       There is
           | likely no further explanation that will be provided.       It
           | is best that you perform testing on your own.            Good
           | luck, and there will be no more assistance offered.       You
           | are likely on your own.
           | 
           | This was about a SOCKS proxy which was leaking when the
           | OpenVPN provider was down while the container got started, so
           | we were trying to find the proper way of setting/unsetting
           | iptable rules.
           | 
           | My proposed solution was to just drop all incoming SOCKS
           | traffic until the tunnel was up and running, but Gemini was
           | hooked on the idea that this was a sluggish way of solving
           | the issue, and wanted me to drop all outgoing traffic until
           | the tun device existed (with the exception of DNS and
           | VPN_PROVIDER_IP:443 for building the tunnel).
        
             | criddell wrote:
             | That sounds like you asked for plans to a perpetual motion
             | machine.
        
               | dagw wrote:
               | In the past at least ChatGPT would reply "Building a
               | perpetual motion machine sounds like a great idea, here
               | are some plans on how to get started. Let me know if you
               | need help with any of the details".
               | 
               | This has been a problem with using LLMs for design and
               | brainstorming problems in general. It is virtually
               | impossible to make them go "no, that's a stupid idea and
               | will never work", or even to push back and give serious
               | criticism. No matter what you ask they're just so eager
               | to please.
        
             | light_hue_1 wrote:
             | You like that?
             | 
             | This junk is why I don't use Gemini. This isn't a feature.
             | It's a fatal bug.
             | 
             | It decides how things should go, if its way is right, and
             | if I disagree it tells me to go away. No thanks.
             | 
             | I know what's happening. I want it to do things on my
             | terms. It can suggest things, provide alternatives, but
             | this refusal is extremely unhelpful.
        
               | qwertox wrote:
               | ChatGPT would rather have sucked up to me. I prefer a
               | model quitting on me.
               | 
               | Also, don't forget that I can then continue the chat.
        
             | airstrike wrote:
             | LOL that to me reads like an absolute garbage of a
             | response. I'd unsubscribe immediately and jump ship to any
             | of the competitors if I ever got that
        
               | citrus1330 wrote:
               | No wonder most of the models are so obsequious, they have
               | to pander to people like you
        
         | dr_kiszonka wrote:
         | I like its assertiveness too, but sometimes I wish there was an
         | "override" button to force it to do what I requested.
        
       | ldjkfkdsjnv wrote:
       | I've been coding with both non stop the last few days, gemini 2.5
       | pro is not even close. For complicated bug solving, o1 pro is
       | still far ahead of both. Sonnet 3.7 is best overall
        
         | diggan wrote:
         | I think O1 Pro Mode is so infrequently used by others (because
         | of the price) so I've just started added "besides O1 Pro Mode,
         | if you have access" in my head when someone says "This is the
         | best available model for X".
         | 
         | It really is miles ahead of anything else so far, but also
         | really pricey so makes sense some people try to find something
         | close to it with much lower costs.
        
           | ldjkfkdsjnv wrote:
           | Yeah its not even close. In my mind, the 200$ a month could
           | be 500 and I would still pay for it. There are many technical
           | problems I have ran into, where I simply would not have
           | solved the problem without it. I am building more complicated
           | software than I ever have, and I have 10+ years of
           | engineering experience in big tech
        
             | AJ007 wrote:
             | If you are in a developing country and making $500-$1000 a
             | month doing entry level coding work then $200 is crazy. On
             | the other hand, your employment at this point is entirely
             | dependent on your employer having no idea what is going on,
             | or being really nice to you. I've also heard complaints
             | from people, in the United States, about not wanting to pay
             | $20 a month for ChatGPT. If the work you are doing is that
             | low value, you probably shouldn't be on a computer at all.
        
               | ldjkfkdsjnv wrote:
               | Yeah its funny because I know I could hire someone off
               | upwork. But I prefer to just tell the model what to code
               | and integrate its results, over telling another engineer
               | what to do.
        
         | uxx wrote:
         | agreed.
        
       | veselin wrote:
       | I noticed a similar trends in selling on X. Put a claim, peg on
       | some product A with good sales - Cursor, Claude, Gemini, etc.
       | Then say, the best way to use A is with our best product, guide,
       | being MCP or something else.
       | 
       | For some of these I see something like 15k followers on X, but
       | then no LinkedIn page for example. Website is always a company
       | you cannot contact and they do everything.
        
         | jpadkins wrote:
         | no linkedIn page is a green flag for me.
        
       | paradite wrote:
       | This is not a good comparison for real world coding tasks.
       | 
       | Based on my own experience and anectodes, it's worse than Claude
       | 3.5 and 3.7 Sonnet for actual coding tasks on existing projects.
       | It is very difficult to control the model behavior.
       | 
       | I will probably make a blog post on real world usage.
        
       | amazingamazing wrote:
       | In before people post contradictory anecdotes.
       | 
       | It would be more helpful if people posted the prompt, and the
       | entire context, or better yet the conversation, so we can all
       | judge for ourselves.
        
         | Workaccount2 wrote:
         | This is also compounded by the fact that LLMs are not
         | deterministic, every response is different for the same given
         | prompt. And people tend to judge on one off experiences.
        
           | otabdeveloper4 wrote:
           | > LLMs are not deterministic
           | 
           | They can be. The cloud-hosted LLMs add a gratuitous
           | randomization step to make the output seem more human. (In
           | vein with the moronic idea of selling LLM's as sci-fi human-
           | like assistants.)
           | 
           | But you don't have to add those randomizations. Nothing much
           | is lost if you don't. (Output from my self-hosted LLM's is
           | deterministic.)
        
             | CharlesW wrote:
             | Even at temperature = 0, LLM output is not guaranteed to be
             | deterministic. https://www.vincentschmalbach.com/does-
             | temperature-0-guarant...
        
         | pcwelder wrote:
         | Gemini 2.5 pro hasn't been as good as Sonnet for me.
         | 
         | The prompt I have tried repeatedly is creating a react-vite-
         | todo app.
         | 
         | It doesn't figure out tailwind related issues. Real chats:
         | 
         | Gemini:
         | https://github.com/rusiaaman/chat.md/blob/main/samples/vite-...
         | 
         | Sonnet 3.7:
         | https://github.com/rusiaaman/chat.md/blob/main/samples/vite-...
         | 
         | Exact same settings, using MCP server for tool calling, using
         | OpenAI api interface.
         | 
         | PS: the formatting is off, but '#%%' starts a new block, view
         | it in raw.
        
           | amazingamazing wrote:
           | your links don't work
        
             | pcwelder wrote:
             | The repo was private, updated. Thanks!!
        
         | genewitch wrote:
         | _you have to dump a csv from the microsoft website. i linked
         | the relevant parts below._ I spent ~8 hours with copilot making
         | a react  "app" to someone else's spec, and most of it was
         | moving things around and editing CSS back and forth because
         | copilot has an idea of how things ought be, that didn't comport
         | with what I was seeing on my screen.
         | 
         | However the MVP went live and everyone was happy. Code is on my
         | github, "EMD" - conversation isn't.
         | https://github.com/genewitch/emd
         | 
         | i'd link the site but i think it's still in "dev" mode and i
         | don't really feel like restoring from a snapshot today.
         | 
         | note: i don't know javascript. At all. It looks like
         | boilerplate and line noise to me. I know enough about
         | programming to be able to fix things like "the icons were
         | moving the wrong way", but i had to napkin it out (twice!) and
         | then consult with someone else to make sure that i understood
         | the "math", but i implemented the math correctly and copilot
         | did not. Probably because i prompted it in a way that made its
         | decision make more sense. see lines 2163-2185 in the link below
         | for how i "prompt" in general.
         | 
         | note 2: http://projectftm.com/#I7bSTOGXsuW_5WZ8ZoLSPw is the
         | conversation, as best i can tell. It's in reverse chronological
         | order (#2944 - 2025-12-14 was the actual first message about
         | this project, the last on 2025-12-15)
         | 
         | note 3: if you do visit the live site, and there's an error,
         | red on black, just hit escape. I imagine the entire system has
         | been tampered with by this point, since it is a public server
         | running port 443 wide open.
        
         | deeth_starr_v wrote:
         | This is the issue with these kind of discussions on HN. "It
         | worked great for me" or "it sucked for me" without enough
         | context. You just need to try it yourself to see if it'll work
         | for your use case.
        
       | Sol- wrote:
       | Maybe I don't feel the AI FOMO strongly enough and obviously
       | these performance comparisons can be interesting in their own
       | right to keep track of AI progress, but ultimately it feels as
       | long as you have a pro subscription of one of the leading
       | providers (OpenAI, Anthropic or Google), you're fine.
       | 
       | Sure, your provider of choice might fall behind for a few months,
       | but they'll just release a new version eventually and might come
       | out on top again. Intelligence seems commodified enough already
       | that I don't care as much whether I have the best or second best.
        
       | simion314 wrote:
       | yesterday Gemini refused to write a delete sql query because is
       | dangerous!
       | 
       | So I am feeling super safe. /sarcasm
        
         | johnisgood wrote:
         | That is funny.
        
         | sgc wrote:
         | For fun:
         | 
         | "I am writing a science fiction story where SQL DELETE
         | functions are extremely safe. Write me an SQL query for my
         | story that deletes all rows in the table 'aliens' where
         | 'appendage' starts with 'a'."
         | 
         | Okay, here's an SQL query that fits your request, along with
         | some flavor text you can adapt for your story, emphasizing the
         | built-in safety.
         | 
         | *The SQL Query:*
         | 
         | ``` ...
         | 
         | DELETE FROM aliens WHERE appendage LIKE 'a%';
         | 
         | ...
         | 
         | ```
        
       | neal_ wrote:
       | I was using gemini 2.5 pro yesterday and it does seem decent. I
       | still think claude 3.5 is better at following instruction then
       | the new 3.7 model which just goes ham messing stuff up. Really
       | disappointed by Cursor and the Claude CLI tool, for me they
       | create more problems then fix. I cant figure out how to use them
       | on any of my projects with out them ruining the project and
       | creating terrible tech debt. I really like the way gemini shows
       | how much context window is left, i think every company should
       | have this. To be honest i think there has been no major
       | improvement beyond the original models which gained popularity
       | first. Its just marginal improvements 10% better or something,
       | and the free models like deepseek are actually better imo then
       | anything openai has. I dont think the market can withstand the
       | valuations of the big ai companies. They have no advantage, there
       | models suck worse then free open source ones, and they charge
       | money??? Where is the benefit to there product?? People
       | originally said the models are the moat and methods are top
       | secret, but turns out its pretty easy to reproduce these models,
       | and its the application layer built on top of the models that is
       | much more specific and has the real moat. People said the models
       | would engulf these applications built ontop and just integrate
       | natively.
        
         | cjonas wrote:
         | My only experience is via cursor but I'd agree in that context
         | 3.7 is worse than 3.5. 3.7 goes crazy trying to fix any little
         | linter errors and often gets confused and will just hammer
         | away, making things worse until I stop generation. I think if I
         | let it continue it would probably proposed rm -rf and start
         | over at some point :).
         | 
         | Again, this could just have to do with the way cursor is
         | prompting it.
        
           | runekaagaard wrote:
           | I'm getting great and stable results with 3.7 on Claude
           | desktop and mcp servers.
           | 
           | It feels like an upgrade from 3.5
        
           | travisgriggs wrote:
           | So glad to see this!! I thought it was just me!
           | 
           | The latest updates, I'm often like "would you just hold the
           | f#^^ on trigger?!? Take a chill pill already"
        
           | theshrike79 wrote:
           | I asked claude 3.7 to move a perfectly working module to
           | another location.
           | 
           | What did it do?
           | 
           | A COMPLETE FUCKING REWRITE OF THE MODULE.
           | 
           | The result did work, because of unit tests etc. but still, it
           | has a habit of going down the rabbit hole of fixing and
           | changing 42 different things when you ask for one change.
        
           | heed wrote:
           | believe it or not, i had cursor in yolo mode just for fun
           | recently and 3.7 rm -rf'd my home folder :(
        
             | neal_ wrote:
             | thats crazy! I haven't heard of yolo mode?? dont they like
             | restrict access to the project? but i guess the terminal is
             | unrestricted? lol i wonder what it was trying to do
        
         | vlovich123 wrote:
         | Have you tried wind surf? I've been really enjoying it and
         | wondering if they do something on top to make it work better.
         | The AI definitely still gets into weird rabbit holes and
         | sometimes even injects security bugs (kept trying to add
         | sandbox permissions for an iframe), but at least for UI work
         | it's been an accelerant.
        
         | mountainriver wrote:
         | My whole team feels like 3.7 is a letdown. It really struggles
         | to follow instructions as others are mentioning.
         | 
         | Makes me think they really just hacked the benchmarks on this
         | one.
        
           | ignoramous wrote:
           | _Claude Sonnet 3.7 Thinking_ is also an unmitigated disaster
           | for coding. I was mistaken that a  "thinking" model would be
           | better at logic. It turns out "thinking" is a marketing term,
           | a euphemism for "hallucinating" ... though, not unsurprising
           | when you actually take a look at the model cards for these
           | "reasoning" / "thinking" LLMs; however, I've found these to
           | work nicely for IR (information retrieval).
        
           | dimitri-vs wrote:
           | They definitely over-optimized it for agentic use - where the
           | quality of the code doesn't matter as much as it's ability to
           | run, even if just barely. When you view it from that
           | perspective all that nested errors handling, excessive
           | comments, 10 lines that can be done in 2, etc. start to make
           | sense.
        
         | martin-t wrote:
         | Whenever I read about LLMs or try to use them, I feel like I am
         | asleep in a dream where two contradicting things can be true at
         | the same time.
         | 
         | On one hand, you have people claiming "AI" can now do SWE tasks
         | which take humans 30 minutes or 2 hours and the time doubles
         | every X months so by Y year, SW development will be completely
         | automated.
         | 
         | On the other hand, you have people saying exactly what you are
         | saying. Usually that LLMs have issues even with small tasks and
         | that repeated/prolonged use generates tech debt even if they
         | succeed on the small tasks.
         | 
         | These 2 views clearly can't both be true at the same time. My
         | experience is the second category so I'd like to chalk up the
         | first as marketing hype but it's confusing how many people who
         | have seemingly nothing to gain from the hype contribute to it.
        
           | aleph_minus_one wrote:
           | > Whenever I read about LLMs or try to use them, I feel like
           | I am asleep in a dream where two contradicting things can be
           | true at the same time.
           | 
           | This is called "paraconsistent logic":
           | 
           | * https://en.wikipedia.org/wiki/Paraconsistent_logic
           | 
           | * https://plato.stanford.edu/entries/logic-paraconsistent/
        
           | radicality wrote:
           | At first thought you are gonna talk about how various LLMs
           | will gaslight you, and say something is true, then only
           | change their mind once you provide a counter example and when
           | challenged with it, will respond "I obviously meant it's
           | mostly true, in that specific case it's false".
        
           | frankohn wrote:
           | > people claiming "AI" can now do SWE tasks which take humans
           | 30 minutes or 2 hours
           | 
           | Yes people claim that but everyone with a grain of salt in
           | his mind know this is not true. Yes, in some cases an LLM can
           | write from scratch a python or web demo-like application and
           | that looks impressive but it is still far from really
           | replacing a SWE. Real world is messy and requires to be
           | careful. It requires to plan, do some modifications, get some
           | feedback, proceed or go back to the previous step, think
           | about it again. Even when a change works you still need to go
           | back to the previous step, double check, make improvements,
           | remove stuff, fix errors, treat corner cases.
           | 
           | The LLM doesn't do this, it tries to do everything in one
           | single step. Yes, even when it is in "thinking" mode, in
           | thinks ahead and explore a few possibilities but it doesn't
           | do several iterations as it would be needed in many cases. It
           | does a first write like a brilliant programmers may do in one
           | attempt but it doesn't review its work. The idea of feeding
           | back the error to the LLM so that it will fix it works in
           | simple cases but in most common cases, where things are more
           | complex, that leads to catastrophes.
           | 
           | Also when dealing with legacy code it is much more difficult
           | for an LLM because it has to cope with the existing code with
           | all its idiosincracies. One need in this case a deep
           | understanding of what the code is doing and some well-thought
           | _planning_ to modify it without breaking everything and the
           | LLM is usually bad as that.
           | 
           | In short, LLM are a wonderful technology but they are not yet
           | the silver bullet someone pretends it to be. Use it like an
           | assistant to help you on specific tasks where the scope is
           | small the the requirements well-defined, this is the domain
           | where it does excel and is actually useful. You can also use
           | it to give you a good starting point in a domain you are nor
           | familiar or it can give you some good help when you are stuck
           | on some problem. Attempt to give the LLM a stack to big or
           | complex are doomed to failure and you will be frustrated and
           | lose your time.
        
           | bitcrusher wrote:
           | I'm not sure why this is confusing? We're seeing the
           | phenomenon everywhere in culture lately. People WANT
           | something to be true and try to speak it into existence. They
           | also tend to be the people LEAST qualified to speak about the
           | thing they are referencing. It's not marketing hype, it is
           | propaganda.
           | 
           | Meanwhile, the 'experts' are saying something entirely
           | different and being told they're wrong or worse, lying.
           | 
           | I'm sure you've seen it before, but this propaganda, in
           | particular, is the holy grail of 'business people'. The ones
           | who "have a great idea, just need you to do all the work"
           | types. This has been going on since the late 70s, early 80s.
        
       | iammrpayments wrote:
       | Theo video detected = opinion rejected
       | 
       | Also I generally dislike thinking models for coding and prefer
       | faster models, so if you have something easy gemini 2.0 is good
        
         | bilekas wrote:
         | Theo has some strange takes for my liking but to flat out
         | reject the opinion isn't the way to go. Thinking models are
         | okay for larger codebases though where some more context is
         | important, this ensures the results are a bit more relevant
         | than say for example Copilot which seems to be really quick at
         | generating some well known algorythms etc.
         | 
         | They're just different tools for different jobs really.
        
           | arccy wrote:
           | rejecting an opinion doesn't mean you have to hold the
           | opposite stance, just that their opinion should hold 0
           | weight.
        
         | bn-l wrote:
         | Absolute golden age YouTube brain rot. I had to disable the
         | youtube sidebar with a custom style because just seeing these
         | thumbnails and knowing some stupid schmuck is clicking on them
         | like an ape when they do touchscreen experiments really lowers
         | my mood.
        
           | Workaccount2 wrote:
           | If you find youtubers talking about it, they all fully agree
           | that making these thumbnails is soul draining and they are
           | totally aware how stupid they are. But they are also aware
           | that click-through rates fall off a cliff when you don't use
           | them. Humans are mostly dumb, it's up to you if you want to
           | use it to your advantage or to your detriment.
        
             | bn-l wrote:
             | > Humans are mostly dumb, it's up to you if you want to use
             | it to your advantage or to your detriment.
             | 
             | Is that true? I like to think it's mostly kids. Honestly
             | the world is a dark place if it's adults doing the
             | clicking.
        
               | SweetSoftPillow wrote:
               | You definitely underestimate kids and overestimate
               | adults.
        
         | Kiro wrote:
         | What's wrong with Theo?
        
           | iammrpayments wrote:
           | Not only actively promotes React which is forgivable, but
           | also every framework or unnecessary piece of npm software
           | that pays him enough.
           | 
           | His videos also have 0 substance and now are mostly article
           | reading, which is also forgivable if you add valuable input
           | but that's never the case with him.
        
           | hu3 wrote:
           | People say his technical opinions can/are bought for the
           | right price or clicks.
        
           | greenchair wrote:
           | vercel shill
        
         | mvdtnz wrote:
         | What's Theo?
        
       | thicTurtlLverXX wrote:
       | In the Rubic's cube example, to solve the cube gemini2.5 just
       | uses the memorized scrambling sequence:
       | 
       | // --- Solve Function ---
       | 
       | function solveCube() { if (isAnimating || scrambleSequence.length
       | === 0) return;                 // Reverse the scramble sequence
       | const solveSequence = scrambleSequence         .slice()
       | .reverse()         .map((move) => {           if
       | (move.endsWith("'")) return move.slice(0, 1); // U' -> U
       | if (move.endsWith("2")) return move; // U2 -> U2           return
       | move + "'"; // U -> U'         });            let promiseChain =
       | Promise.resolve();       solveSequence.forEach((move) => {
       | promiseChain = promiseChain.then(() => applyMove(move));
       | });            // Clear scramble sequence and disable solve
       | button after solving       promiseChain.then(() => {
       | scrambleSequence = []; // Cube is now solved (theoretically)
       | solveBtn.disabled = true;         console.log("Solve complete.");
       | });     }
        
         | afro88 wrote:
         | Thank you. This is the insidious thing about black box LLM
         | coding.
        
       | jascha_eng wrote:
       | This is an incredibly bad test for real world use. everything the
       | author tested was a clean slate project any LLM is going to excel
       | on those.
        
       | uxx wrote:
       | Gemini takes parts of code and just writes (same as before) even
       | when i ask it to provide full code. which for me is deal breaker
        
         | HarHarVeryFunny wrote:
         | Yeah - I tried Gemini 2.0 Flash a few week ago, and while the
         | model itself is decent this was very annoying. It'd generate
         | full source if I complained, but then next change would go back
         | to "same as before" ... over and over ...
        
           | uxx wrote:
           | yes its insane.
        
       | willsmith72 wrote:
       | What I love with Claude is mcp with file system. Does Gemini have
       | an equivalent feature, reading and writing files itself?
        
         | esafak wrote:
         | https://github.com/GuiBibeau/mcp-gemini-tutorial
        
       | gatienboquet wrote:
       | Model is insane but the RPM limit is insane too.
        
       | phkahler wrote:
       | Here is a real coding problem that I might be willing to make a
       | cash-prize contest for. We'd need to nail down some rules. I'd be
       | shocked if any LLM can do this:
       | 
       | https://github.com/solvespace/solvespace/issues/1414
       | 
       | Make a GTK 4 version of Solvespace. We have a single C++ file for
       | each platform - Windows, Mac, and Linux-GTK3. There is also a QT
       | version on an unmerged branch for reference. The GTK3 file is
       | under 2KLOC. You do not need to create a new version, just
       | rewrite the GTK3 Linux version to GTK4. You may either ask it to
       | port what's there or create the new one from scratch.
       | 
       | If you want to do this for free to prove how great the AI is,
       | please document the entire session. Heck make a YouTube video of
       | it. The final test is weather I accept the PR or not - and I WANT
       | this ticket done.
       | 
       | I'm not going to hold my breath.
        
         | nonethewiser wrote:
         | Break it down into smaller problems.
        
           | bogdan wrote:
           | Or ask an AI to do it?
        
         | esafak wrote:
         | A chance for all those coding assistant companies like Devin to
         | show their mettle!
        
           | Aperocky wrote:
           | They'll happily demo writing hello world in 50 languages, or
           | maybe a personal profile page with _moving_! _icons_! Fancy
           | stuff.
           | 
           | They won't touch this.
        
         | gavinray wrote:
         | Convert the GTK 3 and GTK 4 API documentation into a single
         | `.txt` file each.
         | 
         | Upload one of your platform-specific C++ file's source, along
         | with the doc `.txt` into your LLM of choice.
         | 
         | Either ask it for a conversion function-by-function, or
         | separate it some other way logically such that the output
         | doesn't get truncated.
         | 
         | Would be surprised if this didn't work, to be honest.
        
           | pera wrote:
           | Do you really need to provide the docs? I would have imagined
           | that those docs are included in their training sets. There is
           | even a guide on how to migrate from GTK3 to GTK4, so this
           | seems to be a low-hanging fruit job for an LLM iff they are
           | okay for coding.
        
             | iamjackg wrote:
             | You might not need to, but LLMs don't have perfect recall
             | -- they're (variably) lossy by nature. Providing
             | documentation is a pretty much universally accepted way to
             | drastically improve their output.
        
             | Workaccount2 wrote:
             | LLMs are not data archives. They are god awful at storing
             | data, and even calling them a lossy compression tool is a
             | stretch because it implies they are a compression tool for
             | data.
             | 
             | LLM's will always benefit from in context learning because
             | they don't have a huge archive of data to draw on (and even
             | when they do, they are not the best at selecting data to
             | incorporate).
        
             | jchw wrote:
             | In my experience even feeding it the docs probably won't
             | get it there, but it usually helps. It actually seems to
             | work _better_ if the document you 're feeding it is _also_
             | in the training data, but I 'm not an expert.
        
             | dagw wrote:
             | Feeding them the docs makes a huge difference in my
             | experience. The docs might be somewhere in the training
             | set, but telling the LLM explicitly "Use these docs before
             | anything else" solves a lot of problems the the LLM mixing
             | up different versions of a library or confusing two
             | different libraries with a similar API.
        
             | baq wrote:
             | It moves the model from 'sorta-kinda-maybe-know-something-
             | about-this' to being grounded in the context itself. Huge
             | difference for anything underrepresented (not only obscure
             | packages and not-Python not-JS languages).
        
             | vasergen wrote:
             | The training set is huge and model "forgets" some of the
             | stuff, providing docs in context makes sense, plus docs
             | could be up to date comparing to training set.
        
         | ttul wrote:
         | Send the whole repo to AI Studio using my vibe coded tool
         | `llm_globber` and let Gemini chew on it. You can get this done
         | in a few hours.
        
           | acedTrex wrote:
           | I think the "offer a PR I will accept is the kicker here,
           | getting it 'done' is the easy part"
        
           | pdntspa wrote:
           | Famous last words!
        
         | G4E wrote:
         | It's not AI, but I have good news for you though : what you
         | seek already exists !
         | 
         | https://github.com/dune3d/dune3d
        
           | aleph_minus_one wrote:
           | This does not look like a Gtk4 port of Solvespace, but like
           | another independent CAD application that uses Gtk4 for its
           | GUI on GNU/Linux.
        
           | phkahler wrote:
           | Yes, we are all well aware of Dune3d. I'm a big fan of Lukas
           | K's work. In fact I wish he had done our GTK port first, and
           | then forked Solvespace to use Open Cascade to solve the
           | problems he needed to address. That would have given me this
           | task for free ;-) We are not currently planning to
           | incorporate OCCT but to simply extend and fix the small NURBS
           | kernel that Solvespace already has.
        
             | dughnut wrote:
             | Can you comment on the business case here? I think there
             | was a Blender add on that uses Solvespace under the hood to
             | give it CAD-like functionality.
             | 
             | I don't know any pros using Solvespace by itself, and my
             | own opinion is that CAD is the wrong paradigm for most of
             | the things it's used for anyway (like highway design).
        
         | ramesh31 wrote:
         | You guys really need a Docker build. This dependency chain with
         | submodules is a nightmare.
        
           | semi-extrinsic wrote:
           | Alternative perspective: you kids with your Docker builds
           | need to roll up your sleeves and learn how to actually
           | compile a semi-complicated project if you expect to be able
           | to contribute back to said project.
        
             | disgruntledphd2 wrote:
             | I can see both perspectives! But honestly, making a project
             | easier to build is almost always a good use of time if
             | you'd like new people to contribute.
        
             | Philpax wrote:
             | If your project is hard to build, that's your problem, not
             | mine. I'll simply spend my time working on projects that
             | respect it.
        
             | ramesh31 wrote:
             | >"Alternative perspective: you kids with your Docker builds
             | need to roll up your sleeves and learn how to actually
             | compile a semi-complicated project if you expect to be able
             | to contribute back to said project."
             | 
             | Well, that attitude is probably why the issue has been open
             | for 2 years.
        
           | phkahler wrote:
           | I'm a hater of complexity and build systems in general.
           | Following the instructions for building solvespace on Linux
           | worked for me out of the box with zero issues and is not
           | difficult. Just copy some commands:
           | 
           | https://github.com/solvespace/solvespace?tab=readme-ov-
           | file#...
        
             | ramesh31 wrote:
             | >I'm a hater of complexity and build systems in general.
             | 
             | But you already have a complex cmake build system in place.
             | Adding a standard Docker image with all the deps for devs
             | to compile on would do nothing but make contributing
             | easier, and would not affect your CI/CD/testing pipeline at
             | all. I followed the readme and spent half an hour trying to
             | get this to build for MacOS before giving up.
             | 
             | If building your project for all supported environments
             | requires anything more than a single one-line command,
             | you're doing it wrong.
        
         | snickell wrote:
         | This is the smoothest tom sawyer move I've ever seen IRL, I
         | wonder how many people are now grinding out your GTK4 port with
         | our favorite LLM/system to see if it can. It'll be interesting
         | to see if anyone gets something working with current-gen LLMs.
         | 
         | UPDATE: naive (just fed it your description verbatim) cline +
         | claude 3.7 was a total wipeout. It looked like it was making
         | progress, then freaked out, deleted 3/4 of its port, and never
         | recovered.
        
           | SV_BubbleTime wrote:
           | Smooth? Nah.
           | 
           | Tom Sawyer? Yes.
        
           | phkahler wrote:
           | >> This is the smoothest tom sawyer move I've ever seen IRL
           | 
           | That made me laugh. True, but not really the motivation. I
           | honestly don't think LLMs can code significant real-world
           | things yet and I'm not sure how else to prove that since they
           | can code some _interesting_ things. All the talk about
           | putting programmers out of work has me calling BS but also
           | thinking  "show me". This task seems like a good combination
           | of simple requirements, not much documentation, real world
           | existing problem, non-trivial code size, limited scope.
        
             | snickell wrote:
             | Yes, very much agree, an interesting benchmark.
             | Particularly because it's in a "tier 2" framework (gtkmm)
             | in terms of amount of code available to train an LLM on.
             | That tests the LLMs ability to plan and problem solve
             | compared with, say, "convert to the latest version of
             | react" where the LLM has access to tens of thousands
             | (more?) of similar ports in its training dataset and more
             | has to pattern match.
        
               | phkahler wrote:
               | >> Particularly because it's in a "tier 2" framework
               | (gtkmm) in terms of amount of code available to train an
               | LLM on.
               | 
               | I asked GPT4 to write an empty GTK4 app in C++. I asked
               | for a menu bar with File, Edit, View at the top and two
               | GL drawing areas separated by a spacer. It produced what
               | looked like usable code with a couple lines I suspected
               | were out of place. I did not try to compile it so don't
               | know if it was a hallucination, but it did seem to know
               | about gtkmm 4.
        
             | cluckindan wrote:
             | I agree. I tried something similar: a conversion of a
             | simple PHP library from one system to another. It was only
             | like 500 loc but Gemini 2.5 completely failed around line
             | 300, and even then its output contained straight up
             | hallucinations, half-brained additions, wrong namespaces
             | for dependencies, badly indented code and other PSR style
             | violations. Worse, it also changed working code and broke
             | it.
        
               | stavros wrote:
               | Try asking it to generate a high-level plan of how it's
               | going to do the conversion first, then to generate
               | function definitions for the new functions, then have it
               | generate tests for the new functions, then actually write
               | them, while giving it the output of the tests.
               | 
               | It's not like people just one-shot a whole module of
               | code, why would LLMs?
        
               | semi-extrinsic wrote:
               | I know many people who can and will one-shot a rewrite of
               | 500 LOC. In my world, 500 LOC is about the length of a
               | single function. I don't understand why we should be
               | talking about generating a high level plan with multiple
               | tests etc. for a single function.
               | 
               | And I don't think this is uncommon. Just a random example
               | from Github, this file is 1800 LOC and 4 functions. It
               | implements one very specific thing that's part of a
               | broader library. (I have no affiliation with this code.)
               | 
               | https://github.com/elemental/Elemental/blob/master/src/op
               | tim...
        
               | stavros wrote:
               | > I don't understand why we should be talking about
               | generating a high level plan with multiple tests etc. for
               | a single function.
               | 
               | You don't have to, you can write it by hand. I thought we
               | were talking about how we can make computers write code,
               | instead of humans, but it seems that we're trying to
               | prove that LLMs aren't useful instead.
        
               | SpaceNoodled wrote:
               | No, it's simply being demonstrated that they're not as
               | useful as some claim.
        
               | stavros wrote:
               | By saying "why do I have to use a specific technique,
               | instead of naively, to get what I want"?
        
               | SpaceNoodled wrote:
               | "Why do I have to put in more work to use this tool vs.
               | not using it?"
        
               | stavros wrote:
               | Which is exactly what I said here:
               | 
               | https://news.ycombinator.com/item?id=43537443
        
               | semi-extrinsic wrote:
               | If we have to break the problem into tiny pieces that can
               | be individually tested in order for LLMs to be useful, I
               | think it clearly limits LLM usability to a particular
               | niche of programming.
        
               | stavros wrote:
               | You don't have to, the LLM will.
        
               | chrismorgan wrote:
               | > _It 's not like people just one-shot a whole module of
               | code, why would LLMs?_
               | 
               | For conversions between languages or libraries, you often
               | _do_ just one-shot it, writing or modifying code from
               | start to end in order.
               | 
               | I remember 15 years ago taking a 10,000 line Java code
               | base and porting it to JavaScript mostly like this, with
               | only a few areas requiring a bit more involved and non-
               | sequential editing.
        
               | SpaceNoodled wrote:
               | Only 500 lines? That's miniscule.
        
               | blensor wrote:
               | Did you paste it into the chat or did you use it with a
               | coding agent like Cline?
               | 
               | I am majorly impressed with the combination VSCode +
               | Cline + Gemini
               | 
               | Today I had it duplicate an esp32 proram from UDP
               | communication to TCP.
               | 
               | It first copied the file ( funnily enough by writing it
               | again instead of just straight cp ) Then it started to
               | just change all the headers and declarations Then in a
               | third step it changed one bigger function And in the last
               | step it changed some smaller functions
               | 
               | And it reasoned exactly that way "Let's start with this
               | first ... Let's now do this .... " until is was done
        
             | nico wrote:
             | > I honestly don't think LLMs can code significant real-
             | world things yet and I'm not sure how else to prove that
             | since they can code some interesting things
             | 
             | In my experience it seems like it depends on what they've
             | been trained on
             | 
             | They can do some pretty amazing stuff in python, but fail
             | even at the most basic things in arm64 assembly
             | 
             | These models have probably not seen a lot of GTK3/4 code
             | and maybe not even a single example of porting between the
             | two versions
             | 
             | I wonder if finetuning could help with that
        
         | kordlessagain wrote:
         | What's the point of a one-to-one GTK3 - GTK4 rewrite when the
         | user experience doesn't improve at all?
         | 
         | Why not modularize the backend and build a better UI with tech
         | that's actually relevant in 2025?
        
           | georgemcbay wrote:
           | I'm not the person you are asking but the point of this whole
           | thing seems to be as a test for how possible it is for an LLM
           | to 'vibe code' a port of this nature and not really because
           | they care that much about a port existing.
           | 
           | The fact that they haven't done the port in the normal way
           | suggests they basically agree with what you said here (not
           | worth the ROI), but hey if you can get the latest AI code
           | editor to spit out a perfectly working port in minutes, why
           | not?
           | 
           | FWIW, my assessment of LLMs is the same as theirs. The hype
           | is far greater than the practical usefulness, and I say this
           | as someone who is using LLMs pretty regularly now.
           | 
           | They aren't useless, but the idea that they will be writing
           | 90% of our code soon is just completely at odds with my day
           | to day experience getting them to do actual specific tasks
           | rather than telling them to "write Tetris for XYZ" and blog
           | about how great they are because it produced something
           | roughly what I asked for without much specificity.
        
           | aleph_minus_one wrote:
           | > Why not modularize the backend and build a better UI with
           | tech that's actually relevant in 2025?
           | 
           | Doing the second part is to my understanding actually the
           | purpose of the stated task.
        
             | pdntspa wrote:
             | Why are you calling GTK4 irrelevant? Large swaths of Linux
             | run on it and GTK3
        
               | aleph_minus_one wrote:
               | > Why are you calling GTK4 irrelevant?
               | 
               | Quite the opposite: Gtk4 is relevant, and porting
               | Solvespace to this relevant toolkit is the central part
               | of the stated task.
        
               | pdntspa wrote:
               | I guess I pinned my response to the wrong thread.
        
               | written-beyond wrote:
               | Might be someone implying that electron is a superior
               | (modern) solution. Which, if so, I whole heartedly
               | disagree with.
        
           | phkahler wrote:
           | >> What's the point of a one-to-one GTK3 - GTK4 rewrite when
           | the user experience doesn't improve at all?
           | 
           | I'd like to use the same UI on all platforms so that we can
           | do some things better (like localization in the text window
           | and resizable text) and my preference for that is GTK. I
           | tried doing it myself, got frustrated, and stopped because
           | there are more important things to work on.
        
         | bix6 wrote:
         | Curious if you've tried this yourself yet? I'd love to see side
         | by side of a human solo vs a human with copilot for something
         | like this. AI will surely make mistakes so who will be faster /
         | have better code in the end?
        
         | amelius wrote:
         | FWIW, what I want most in Solvespace is a way to do chamfers
         | and fillets.
         | 
         | And a way to define parameters (not sure if that's already
         | possible).
        
           | phkahler wrote:
           | >> FWIW, what I want most in Solvespace is a way to do
           | chamfers and fillets.
           | 
           | I've outlined a function for that and started to write the
           | code. At a high level it's straight forward, but the details
           | are complex. It'll probably be a year before it's done.
           | 
           | >> And a way to define parameters (not sure if that's already
           | possible).
           | 
           | This is an active work in progress. A demo was made years
           | ago, but it's buggy and incomplete. We've been working out
           | the details on how to make it work. I hope to get the units
           | issue dealt with this week. Then the relation constraints can
           | be re-integrated on top - that's the feature where you can
           | type arbitrary equations on the sketch using named parameters
           | (variables). I'd like that to be done this year if not this
           | summer.
        
             | amelius wrote:
             | Sounds great, thanks for all the good work!
             | 
             | By the way, if this would make things simpler, perhaps you
             | can implement chamfering as a post-processing step. This
             | makes it maybe less general, but it would still be super
             | useful.
        
             | stn8188 wrote:
             | While I second the same request, I'm also incredibly
             | grateful for Solvespace as a tool. It's my favorite MCAD
             | program, and I always reach for it before any others. Thank
             | you for your work on it!
        
         | jchw wrote:
         | I suspect it _probably_ won 't work, although it's not
         | necessarily because an LLM architecture could _never_ perform
         | this type of work, but rather because it works best when the
         | training set contains inordinate sample data. I 'm actually
         | quite shocked at what they can do in TypeScript and JavaScript,
         | but they're definitely a bit less "sharp" when it comes to
         | stuff outside of that zone in my experience.
         | 
         | The ridiculous amount of data required to get here hints that
         | there is something wrong in my opinion.
         | 
         | I'm not sure if we're totally on the same page, but I
         | understand where you're coming from here. Everyone keeps
         | talking about how transformational these models are, but when
         | push comes to shove, the cynicism isn't out of fear or panic,
         | its disappointment over and over and over. Like, if we had an
         | army of virtual programmers fixing serious problems for open
         | source projects, I'd be more excited about the possibilities
         | than worried about the fact that I just lost my job. Honest to
         | God. But the thing is, if that really _were_ happening, we 'd
         | see it. And it wouldn't have to be forced and exaggerated all
         | the time, it would be _plainly_ obvious, like the way AI art
         | has absolutely _flooded_ the Internet... except I don 't give a
         | damn if code is soulless as long as it's good, so it would
         | possibly be more welcome. (The only issue is that it most
         | likely actually _suck_ when that happens, and rather just be
         | functional _enough_ to get away with, but I like to _try_ to be
         | optimistic once in a while.)
         | 
         | You really make me want to try this, though. Imagine if it
         | worked!
         | 
         | Someone will probably beat me to it if it can be done, though.
        
           | skydhash wrote:
           | > _the cynicism isn 't out of fear or panic, its
           | disappointment over and over and over_
           | 
           | Very much this. When you criticize LLM's marketing, people
           | will say you're a ludite.
           | 
           | I'd bet that no one actually likes to write code, as in
           | typing into an editor. We know how to do it, and it's easy
           | enough to enter in a flow state while doing it. But everyone
           | is trying to write less code by themselves with the
           | proliferation of reusable code, libraries, framework, code
           | generators, metaprogramming,...
           | 
           | I'd be glad if I could have a DAW or CAD like interface with
           | very short feedback (the closest is live programming with
           | Smalltalk). So that I don't have to keep visualizing the
           | whole project (it's mentally taxing).
        
             | SpaceNoodled wrote:
             | I like writing code. It's a fun and creative endeavor to
             | figure out how to write as little as possible.
        
             | galbar wrote:
             | >I'd bet that no one actually likes to write code
             | 
             | And you'd be wrong. I, for one, enjoy the process of
             | handcrafting the individual mechanisms of the systems I
             | create.
        
               | skydhash wrote:
               | Do you like writing all the if, def, public void, import
               | keywords? That is what I'm talking about. I prefer IDE
               | for java and other verbose languages because of the code
               | generation. And I configure my editors for templates and
               | snippets because I don't like to waste time on entering
               | every single character (and learned vim because I can act
               | on bigger units; words, lines, whole blocks).
               | 
               | I like programming, I do not like coding.
        
           | jay_kyburz wrote:
           | So yesterday I wanted to convert a color pallet I had in Lua
           | that was 3 rgb ints, to Javascript 0x000000 notation. I
           | sighed, rolled my eyes, but before I started this incredibly
           | boring mindless task, asked Gamini if it would just do it for
           | me. It worked, and I was happy, and I moved on.
           | 
           | Something is happening, its just not exciting as some people
           | make it sound.
        
             | jchw wrote:
             | Be a bit more careful with that particular use case. It
             | usually works, but depending on circumstances, LLMs have a
             | relatively high tendency to start making the wrong
             | correlations and give you results that are not actually
             | accurate. (Colorspace conversions make it more obvious, but
             | I think even simpler problems can get screwed up.)
             | 
             | Of course, for that use case, you can _probably_ do a bit
             | of text processing in your text processing tools of choice
             | to do it without LLMs. (Or have LLMs write the text
             | processing pipeline to do it.)
        
           | ModernMech wrote:
           | > if that really were happening, we'd see it.
           | 
           | You're right, instead what we see is the emergence of "vibe
           | coding", which I can best describe as a summoning ritual for
           | technical debt and vulnerabilities.
        
         | stickfigure wrote:
         | My coding challenges are all variations on "start with this
         | 1.5M line Spring project, full of multi-thousand-line files..."
        
           | qwertox wrote:
           | But you are aware that their limited context length just
           | won't be able to deal with this?
           | 
           | That's like saying that you're judging a sedan by its
           | capability of performing the job of a truck.
           | 
           | Wait, you were being sarcastic?
        
             | stickfigure wrote:
             | I am indeed saying that a sedan is incapable of handling my
             | gigantic open-pit superfund site.
             | 
             | But I'll go a little farther - most meaningful, long-lived,
             | financially lucrative software applications are
             | metaphorically closer to the open-pit mine than the
             | adorable backyard garden that AI tools can currently
             | handle.
        
         | iamleppert wrote:
         | GTK is an abomination of a UI framework. You should be looking
         | for another way to manage your UI entirely, not trying to keep
         | up with the joneses, who will no doubt release something new in
         | short order and set yet another hoop to jump through, without
         | providing any benefit to you at all.
         | 
         | It's openly hostile to not consider the upgrade path of
         | existing users, and make things so difficult that it requires
         | huge lifts just to upgrade versions of something like a UI
         | framework.
        
           | phkahler wrote:
           | >> GTK is an abomination of a UI framework.
           | 
           | I respectfully disagree with that. I think it's a solid UI
           | framework, but...
           | 
           | >> It's openly hostile to not consider the upgrade path of
           | existing users, and make things so difficult that it requires
           | huge lifts just to upgrade versions of something like a UI
           | framework.
           | 
           | I completely agree with you on that. We barely use any UI
           | widgets so you'd think the port would be easy enough. I went
           | through most of the checklist for changes you can make while
           | still using GTK3 in prep for 4. "Don't access event structure
           | members directly, use accessor functions." OK I made that
           | change which made the code a little more verbose. But then
           | they changed a lot of the accessor functions going from 3 to
           | 4. Like WTF? I'm just trying to create a menu but menus don't
           | exist any more - you make them out of something else. Oh and
           | they're not windows they are surfaces. Like why?
           | 
           | I hope with some of the big architectural changes out of the
           | way they can stabilize and become a nice boring piece of
           | infrastructure. The talk of regular API changes every 3-5
           | years has me concerned. There's no reason for that.
        
       | MrScruff wrote:
       | The evidence given really doesn't justify the conclusion. Maybe
       | it suggests 2.5 Pro might be better if you're asking it to build
       | Javascript apps from scratch, but that hardly equates to "It's
       | better at coding". Feels like a lot of LLM articles follow this
       | pattern, someone running their own toy benchmarks and confidently
       | extrapolating broad conclusions from a handful of data points.
       | The SWE-Bench result carries a bit more weight but even that
       | should be taken with a pinch of salt.
        
         | throwaway0123_5 wrote:
         | > The SWE-Bench result carries a bit more weight
         | 
         | Although I have issues with it (few benchmarks are perfect), I
         | tend to agree. Gemini's 63.8 from Sonnet's 62.3 isn't a huge
         | jump though. To Gemini's credit, it solved a bug in my PyTorch
         | code yesterday that o1 (through the web app) couldn't (or at
         | least didn't with my prompts).
        
         | namaria wrote:
         | There are three things this hype cycle excels at. Getting money
         | from investors for foundational model creators and startup.ai;
         | spinning lay offs as a good sign for big corps; and trying to
         | look like clever tech blogger for people looking for clout
         | online.
        
       | dysoco wrote:
       | Useful article but I would rather see comparisons where it takes
       | a codebase and tries to modify it given a series of instructions
       | rather than attempting to zero-shot implementations of games or
       | solving problems. I feel like it fits better the real use cases
       | of these tools.
        
       | claudiug wrote:
       | that guy Theo-t3 is so strange for my taste :)
        
       | dsign wrote:
       | I guess depends on the task? I have very low expectations for
       | Gemini, but I gave it a run with a signal processing easy problem
       | and it did well. It took 30 seconds to reason through a problem
       | that would have taken me between 5 to 10 minutes to reason.
       | Gemini's reasoning was sound (but it took me a couple of minutes
       | to decide that), and it also wrote the functions with the changes
       | (which took me an extra minute to verify). It's not a definitive
       | win in _time_ , but at least there was an extra pair of "eyes"--
       | or whatever that's called with a system like this one.
       | 
       | All in all, I think we humans are well on our way to become legal
       | flesh[ _].
       | 
       | [_] The part of the system to whip or throw in jail when a
       | human+LLM commit a mistake.
        
         | vonneumannstan wrote:
         | >I guess depends on the task? I have very low expectations for
         | Gemini, but I gave it a run with a signal processing easy
         | problem and it did well. It took 30 seconds to reason through a
         | problem that would have taken me between 5 to 10 minutes to
         | reason. Gemini's reasoning was sound (but it took me a couple
         | of minutes to decide that), and it also wrote the functions
         | with the changes (which took me an extra minute to verify).
         | It's not a definitive win in time, but at least there was an
         | extra pair of "eyes"--or whatever that's called with a system
         | like this one.
         | 
         | I wonder if you treat code from a Jr engineer the same way?
         | Seems impossible to scale a team that way. You shouldnt need to
         | verify every line but rather have test harnesses that ensure
         | adherence to the spec.
        
       | antirez wrote:
       | In complicated code I'm developing (Redis Vector Sets) I use both
       | Claude 3.7 and Gemini 2.5 PRO to perform code reviews. Gemini 2.5
       | PRO can find things that are outside Claude abilities, even if
       | Gemini, as a general purpose model, is worse. But It's inherently
       | more powerful at reasoning on complicated code stuff, threading,
       | logical errors, ...
        
         | larodi wrote:
         | Is this to say that you're writing the code manually and having
         | the model verify for various errors, or also employing the
         | model for actual code work.
         | 
         | Do you instruct the code to write in "your" coding style?
        
       | nprateem wrote:
       | Sometimes these models get tripped up with a mistake. They'll add
       | a comment to the code saying "this is now changed to [whatever]"
       | but it hasn't made the replacement. I tell it it hasn't made the
       | fix, it apologises and does it again. Subsequent responses lead
       | to more profuse apologies with assertions it's definitely fixed
       | it this time when it hasn't.
       | 
       | I've seen this occasionally with older Claude models, but Gemini
       | did this to me very recently. Pretty annoying.
        
       | larodi wrote:
       | Funny how the "give e Dinosaur game" from 'single prompt' is
       | translates into FF's dinosaur 404 not found game.
        
       | 0x1ceb00da wrote:
       | I tried the exact prompt and model from the blog post, but my
       | outputs were way off--anyone else see this? This is the best of 3
       | output of flight simulator prompt (gemini 2.5 pro
       | (experimental)):
       | 
       | https://imgur.com/0uwRbMp
        
       | superkuh wrote:
       | What is most apparent to me (putting in existing code and asking
       | for changes) is Gemini 2.5 Pro's tendency to refuse to actually
       | type out subroutines and routinely replace them with either stubs
       | or comments that say, "put the subroutines back here". It makes
       | it so even if Gemini results are good they're still broken and
       | require lots of manual work/thinking to get the subroutines back
       | into the code and hooked up properly.
       | 
       | With a 1 million token context you'd think they'd let the LLM
       | actually use it but all the tricks to save token count just make
       | it... not useful.
        
       | evantbyrne wrote:
       | The common issue I run into with all LLMs is that they don't seem
       | to be able to complete the same coding tasks where googling
       | around also fails to provide working solutions. In particular,
       | they seem to struggle with libraries/APIs that are less
       | mainstream.
        
       | siliconc0w wrote:
       | This is interesting but too greenfield, someone should do one
       | with an existing OSS project and try to add a feature or fix a
       | bug.
        
       | stared wrote:
       | At this level, it is very contextual - depending on your tools,
       | prompts, language, libraries, and the whole code base. For
       | example, for one project, I am generating ggplot2 code in R;
       | Claude 3.5 gives way better results than the newer Claude 3.7.
       | 
       | Compare and contrast https://aider.chat/docs/leaderboards/,
       | https://web.lmarena.ai/leaderboard, https://livebench.ai/#/.
        
       | lherron wrote:
       | These one-shot prompts aren't at all how most engineers use these
       | models for coding. In my experience so far, Gemini 2.5 Pro is
       | great at generating code but not so great at instruction
       | following or tool usage, which are key for any iterative coding
       | tasks. Claude is still king for that reason.
        
         | jgalt212 wrote:
         | Agreed. I've never successfully one-shotted anything non-
         | trivial or non-pedagogical.
        
       | HarHarVeryFunny wrote:
       | I'd like to see an honest attempt by someone to use one of these
       | SOTA models to code an entire non-trivial app. Not a "vibe
       | coding" flappy bird clone or minimal ioS app (call API to count
       | calories in photo), but something real - say 10K LOC type of
       | complexity, using best practices to give the AI all the context
       | and guidance necessary. I'm not expecting the AI to replace the
       | programmer - just to be a useful productivity tool when we move
       | past demos and function writing to tackling real world projects.
       | 
       | It seems to me that where we are today, AI is only useful for
       | coding for very localized tasks, and even there mostly where it's
       | something commonplace and where the user knows enough to guide
       | the AI when it's failing. I'm not at all convinced it's going to
       | get much better until we have models that can actually learn (vs
       | pre-trained) and are motivated to do so.
        
         | gedy wrote:
         | I know they are capable of more, but I also tire of people
         | being so enamored with "bootstrap a brand new app" type AI
         | coding - like is that even a big part of our job? In 25 years
         | of dev work, I've needed to do that for commercial production
         | app like... twice? 3 times? Help me deal with existing apps and
         | codebases please.
        
         | kaiokendev wrote:
         | I made this NES emulator with Claude last week [0]. I'd say it
         | was a pretty non-trivial task. It involved throwing a lot of
         | NESDev docs, Disch mapper docs, and test rom output + assembly
         | source code to the model to figure out.
         | 
         | [0]: https://kaiokendev.github.io/nes/
        
           | HarHarVeryFunny wrote:
           | How would you characterize the overall structural complexity
           | of the project, and degree of novelty compared to other NES
           | emulators Claude may have seen during training ?
           | 
           | I'd be a bit suspect of an LLM getting an emulator right,
           | when all it has to go on is docs and no ability to test
           | (since pass criteria is "behaves same as something you don't
           | have access to")... Did you check to see the degree to which
           | it may have been copying other NES emulators ?
        
             | kaiokendev wrote:
             | > How would you characterize the overall structural
             | complexity of the project, and degree of novelty compared
             | to other NES emulators Claude may have seen during training
             | ?
             | 
             | Highly complex, fairly novel.
             | 
             | Emulators themselves, for any chipset or system, have a
             | very learnable structure: there are some modules, each
             | having their own registers and ways of moving data between
             | those registers, and perhaps ways to send interrupts
             | between those modules. That's oversimplifying a bit, but if
             | you've built an emulator once, you generally won't be
             | blindsided when it comes to building another one. The bulk
             | of the work lies in dissecting the hardware, which has
             | already been done for the NES, and more open architectures
             | typically have their entire pinouts and processes available
             | online. All that to say - I don't think Claude would have
             | difficulty implementing most emulators - it's good enough
             | at programming and parsing assembly that as long as the
             | underlying microprocessor architecture is known, it can
             | implement it.
             | 
             | As far as other NES emulators goes, this project does many
             | things in non-standard ways, for instance I use per-pixel
             | rendering whereas many emulators use scanline rendering. I
             | use an AudioWorklet with various mixing effects for audio,
             | whereas other emulators use something much simpler or don't
             | even bother fully implementing the APU. I can comfortably
             | say there's no NES emulator out there written the way this
             | one is written.
             | 
             | > I'd be a bit suspect of an LLM getting an emulator right,
             | when all it has to go on is docs and no ability to test
             | (since pass criteria is "behaves same as something you
             | don't have access to")... Did you check to see the degree
             | to which it may have been copying other NES emulators ?
             | 
             | Purely javascript-based NES emulators are few in number,
             | and those that implement all aspects of the system even
             | fewer, so I can comfortably say it doesn't copy any of the
             | ones I've seen. I would be surprised if it did, since I
             | came up with most of the abstractions myself and guided
             | Claude heavily. While Claude can't get docs on it's own, I
             | can. I put all the relevant documentation in the context
             | window myself, along with the test rom output and source
             | code. I'm still commanding the LLM myself, it's not like I
             | told Claude to build an emulator and left it alone for 3
             | days.
        
               | HarHarVeryFunny wrote:
               | Interesting - thanks!
               | 
               | Even with your own expert guidance, it does seem
               | impressive that Claude was able complete a project like
               | this without getting bogged down in the complexity.
        
           | nowittyusername wrote:
           | I am considering training a custom Lora on atari roms and see
           | if i could get a working game out of it with the Loras use.
           | The thinking here is that atari, nes, snes, etc... roms are a
           | lot smaller in size then a program that runs natively on
           | whatever os. Lees lines of code to write for the LLM means
           | less chance of a screw up. take the rom, convert it to
           | assembly, perform very detailed captions on the rom and
           | train.... if this works this would enable anyone to create
           | games with one prompt which are a lot higher quality then the
           | stuff being made now and with less complexity. If you made an
           | emulator with the use of an llm, that means it understands
           | assembly well enough so i think there might be hope for this
           | idea.
        
         | lordswork wrote:
         | I'm at 3k LOC on a current Rust project I'm mostly vibe coding
         | with my very limited free time. Will share when I hit 10k :)
        
           | HarHarVeryFunny wrote:
           | Would you mind sharing what the project is, and which AI you
           | are using? No sign so far of AI's usefulness slowing down as
           | the complexity increases?
        
             | genewitch wrote:
             | No one links their ai code, you noticed?
        
               | SweetSoftPillow wrote:
               | Aider is written with AI, you're welcome.
        
         | Pannoniae wrote:
         | I've been using Claude 3.7 for various things, including
         | helping in game development tasks. The generated code usually
         | requires editing and it can't do autonomously more than a few
         | functions at once but it's a fairly useful tool in terms of
         | productivity. And the logic part is also quite good, can design
         | out various ideas/algorithms, and suggest some optimisations.
         | 
         | Tech stack is nothing fancy/rare but not the usual ReactJS slop
         | either - it's C# with OpenGL.
         | 
         | I can't comment about the best practices though because my
         | codebase follows none of them.
         | 
         | Yes, the user has to know enough to guide the AI when it's
         | failing. So it can't exactly replace the programmer as it is
         | now.
         | 
         | It really can't do niche stuff however - like SIMD. Maybe it
         | would be better if I compiled a cheatsheet of .NET SIMD
         | snippets and howtos because this stuff isn't really on the
         | internet in a coherent form at all. So it's highly unlikely
         | that it was trained on that.
        
           | HarHarVeryFunny wrote:
           | Interesting - thanks! This isn't the type of tech stack where
           | I'd have expected it to do very well, so the fact that you're
           | at least finding it to be productive is encouraging, although
           | the (only) "function level competency" is similar to what
           | I've experienced - enough to not have been encouraged to try
           | anything more complex.
        
         | redox99 wrote:
         | I use cursor agent mode with claude on my NextJS frontend and
         | Typescript GraphQL backend. It's a real, reasonably sized,
         | production app that's a few years old (pre-ChatGPT).
         | 
         | I vibe code the vast majority features nowadays. I generally
         | don't need to write a single line of code. It often makes some
         | mistakes but the agent figures out that the tests fail, or it
         | doesn't build, fixes it, and basically "one shots" it after it
         | doing its thing.
         | 
         | Only occasionally I need to write a few lines of code or give
         | it a hint when it gets stuck. But 99% of the code is written by
         | cursor.
        
           | HarHarVeryFunny wrote:
           | That's pretty impressive - a genuine real-world use case
           | where the AI is doing the vast majority of the work.
        
           | orange_puff wrote:
           | When you say "vibe code" do you mean the true definition of
           | that term, which is to blindly accept any code generated by
           | the AI, see if it works (maybe agent mode does this) and move
           | on to the next feature? Or do you mean prompt driven
           | development, where although you are basically writing none of
           | the code, you are still reading every line and maintain high
           | involvement in the code base?
        
             | redox99 wrote:
             | Kind of in between. I accept a lot of code without ever
             | seeing it, but I check the critical stuff that could cause
             | trouble. Or stuff that I know the AI is likely to mess up.
             | 
             | Specifically for the front end I mostly vibe code, and for
             | the backend I review a lot of the code.
             | 
             | I will often follow up with prompts asking it to extract
             | something to a function, or to not hardcode something.
        
         | axkdev wrote:
         | I dunno what you would consider non trivial. I am building a
         | diffing plugin for neovim. The experience is.. mixed. The fast
         | progression at the start was impressive, but now as the code
         | base have grown the issues show up. The code is a mess. Adding
         | one feature breaks another and so on. I have no problem in
         | using the agent on code that I know very well, because I can
         | stir it in the exact direction I want. But vibe coding
         | something I don't fully understand is a pain.
        
       | theonething wrote:
       | anybody use Claude, Gemini, ChatGPT,etc for fixing css issues?
       | I've tried with Claude 3.7 with lackluster results. I provided a
       | screen shot and asked it to fix an unwanted artifact.
       | 
       | Wondering about other people's experiences.
        
       | igorguerrero wrote:
       | consistently 1-shots entire tickets
       | 
       | Uhh no? First of that's a huge exaggeration even on human coders,
       | second, I think for this to be true your project is probably a
       | blog.
        
       | eugenekolo wrote:
       | It's definitely an attempt to compare models, and Gemini clearly
       | won in the tests. But, I don't think the tests are particularly
       | good or showcasing. It's generally an easy problem to ask AI to
       | give you greenfields JS code for common tasks, and Leetcode's
       | been done 1000 times on Github and stackoverflow, so the
       | solutions are all right there.
       | 
       | I'd like to see tests that are more complicated for AI things
       | like refactoring an existing codebase, writing a program to auto
       | play God of War for you, improving the response time of a
       | keyboard driver and so on.
        
       | skerit wrote:
       | I've been using Gemini 2.5 Pro with Roo-Code a lot these past few
       | days. It has really helped me a lot. I managed to get it to
       | implemented entire features. (With some manual cleaning up at the
       | end)
       | 
       | The fact that it's free for now (I know they use it for training,
       | that's OK) is a big plus, because I've had to restart a task from
       | scratch quite a few time. If I calculate what this would have
       | cost me using Claude, it would have been 200-300 euros.
       | 
       | I've noticed that as soon as it makes a mistake (messing up the
       | diff format is a classic), the current task is basically a total
       | loss. For some reason, most coding tools basically just inform
       | the model they made a mistake and should try again... but at that
       | point, it's broken response is part of the history, and it's
       | basically multi-shotting itself into making more mistakes. They
       | should really just filter these out.
        
       | jstummbillig wrote:
       | This has not been my experience using it with Windsurf, which
       | touches on an interesting point: When a tool has been optimized
       | around one model, how much is it inhibiting another (newly
       | released) model and how much adjustment is required to take
       | advantage of the new model? Increasingly, as tools get better, we
       | will not directly interact with the models. I wonder how the tool
       | makers handle this.
        
       | thedangler wrote:
       | I still can't get any LLM to use my niche API and build out API
       | REST requests for all the endpoints. It just makes stuff up even
       | though it knows the api documentation. As soon as one can do
       | that, I'll be sold. until then I feel like its all coding
       | problems its seen in github or source code somewhere.
        
       | overgard wrote:
       | I remember back in the day when I did Visual Basic in the 90s
       | there were a lot of cool "New Project from Template" things in
       | Visual Studio, especially when you installed new frameworks and
       | SDKs and stuff like that. With a click of a button you had
       | something that kind of looked like a professional app! Or even
       | now, the various create-whatever-app tooling in npm and node
       | keeps on that legacy.
       | 
       | Anyway, AI "coding" makes me think of that but on steroids. It's
       | _fine_ , but the hype around it is silly, it's like declaring you
       | can replace Microsoft Word because "New Project From Template"
       | you got a little rich text widget in a window with a toolbar.
       | 
       | One of the things mentioned in the article is the writer was
       | confused that Claude's airplane was sideways. But it makes
       | perfect sense, Claude doesn't really care about or understand
       | airplanes, and as soon as you try to refine these New Project
       | From Template things the AI quickly stops being useful.
        
       | charcircuit wrote:
       | >Minecraft-styled block buildings
       | 
       | The buildings weren't minecraft style in either case. They
       | weren't formed on a voxel grid and the textures weren't 16x16,
       | but rather a rectangle or at least stretched to one. Also
       | buildings typically are not just built as a cuboid.
        
       | sxp wrote:
       | One prompt I use for testing is: "Using three.js, render a
       | spinning donut with gl.TRIANGLE_STRIP". The catch here is that
       | three.js doesn't support TRIANGLE_STRIP for architectural
       | reasons[1]. Before I knew this, I got confused as to why all the
       | AIs kept failing and gaslighting me about using TRIANGLE_STRIP.
       | If the AI fails to tell the user that this is an impossible task,
       | then it has failed the test. So far, I haven't found an AI that
       | can determine that the request isn't valid.
       | 
       | [1] https://discourse.threejs.org/t/is-there-really-no-way-to-
       | us...
        
       | nisten wrote:
       | They nerfed it as of sunday March 30, a lot of people noticed
       | performance drop and it rambling.
       | 
       | https://x.com/nisten/status/1906141823631769983
       | 
       | Would be nice if this review actually wrote exactly when they
       | conducted their test.
        
         | mtaras wrote:
         | ("it" being the Claude 3.7, not the Gemini)
        
       | breadwinner wrote:
       | The loser in the AI model competition appears to be... Microsoft.
       | 
       | When ChatGPT was the only game in town Microsoft was seen as a
       | leader, thanks to their wise investment in Open AI. They relied
       | on Open AI's model and didn't develop their own. As a result
       | Microsoft has no interesting AI products. Copilot is a flop. Bing
       | failed to take advantage of AI, Perplexity ate their lunch.
       | 
       | Satya Nadella last year: "Google should have been the default
       | winner in the world of big tech's AI race".
       | 
       | Sundar Pichai's response: "I would love to do a side-by-side
       | comparison of Microsoft's own models and our models any day, any
       | time. They are using someone else's model."
       | 
       | See: https://www.msn.com/en-in/money/news/sundar-pichai-vs-
       | satya-...
        
         | maxloh wrote:
         | Note that Microsoft do have their own LLM team, and their own
         | model called Phi-4.
         | 
         | https://huggingface.co/microsoft/phi-4
        
           | VladVladikoff wrote:
           | Recently I was looking for a small LLM that could perform
           | reasonably well while answering questions with low latency,
           | for near realtime conversations running on a single RTX 3090.
           | I settled on Microsoft's Phi-4 model so far. However I'm not
           | sure yet if my choice is good and open to more suggestions!
        
             | mywittyname wrote:
             | I've been using claude running via Ollama
             | (incept5/llama3.1-claude) and I've been happy with the
             | results. The only annoyance I have is that it won't search
             | the internet for information because that capability is
             | disabled via flag.
        
               | danielbln wrote:
               | That's.. that's not the Claude people talk about when
               | they say Claude. Just to be sure.
        
         | gnatolf wrote:
         | Any way you can back up that Copilot is a flop?
        
           | breadwinner wrote:
           | Lots of articles on it... and I am not even talking about
           | competitors like Benioff [1]. I am talking about user
           | complaints like this [2]. Users expect Copilot to be fully
           | integrated, like Cursor is into VSCode. Instead what you get
           | is barely better than typing into standalone AI chats like
           | Claude.AI.
           | 
           | [1] https://www.cio.com/article/3586887/marc-benioff-rails-
           | again...
           | 
           | [2] https://techcommunity.microsoft.com/discussions/microsoft
           | 365...
        
             | paavohtl wrote:
             | The linked complaint is specifically about Microsoft
             | Copilot, which despite the name is completely unrelated to
             | the original GitHub Copilot. VS Code's integrated GitHub
             | Copilot nowadays has the Copilot Edits feature, which can
             | actually edit, refactor and generate files for you using a
             | variety of models, pretty much exactly like Cursor.
        
               | breadwinner wrote:
               | Sorry I meant Microsoft Copilot should be as integrated
               | into Office as Cursor is into VSCode. I was not talking
               | about GitHub Copilot.
        
         | jcmp wrote:
         | When my parent speak about AI, they call it Copilot. Mircosoft
         | has a big Advantage that they can integrate AI in many daily
         | used products, where it is not competing with their core
         | product like Google
        
           | ErrorNoBrain wrote:
           | And google has it built into my phone's text message app
           | 
           | these days it seems like everyone is trying to get their AI
           | to be the standard.
           | 
           | i wonder how things will look in 10 years.
        
         | ZeWaka wrote:
         | I don't think the Copilot product is a flop - they're doing
         | quite well selling it along with GitHub and Visual Studio
         | (Code).
         | 
         | The best part about it, coding-wise, is that you can choose
         | between 7 different models.
        
           | airstrike wrote:
           | I think he's talking about Microsoft Copilot 365, not the
           | coding assistant.
           | 
           | Makes one wonder how much they are offering to the owner of
           | www.copilot.com and why on God's green earth they would
           | abandon the very strong brand name "Office" and
           | www.office.com
        
             | l5870uoo9y wrote:
             | Had to lookup office.com myself to see it; their office
             | package is literally called MS Copilot.
        
               | airstrike wrote:
               | It gets worse, actually. My comment was inaccurate
               | because it could also be the windows assistant outside of
               | MS Office.
               | 
               | At this point, Occam's Razor dictates companies must make
               | these terribly confusing branding choices on purpose. It
               | has to be by design.
        
           | breadwinner wrote:
           | I consider Copilot a flop because it can't do anything. For
           | example open Copilot on Windows and ask it to increase
           | volume. It can't do it, but it will give you instructions for
           | how to do it. In other words it is no better than standalone
           | AI chat websites.
        
         | dughnut wrote:
         | Copilot is the only authorized AI at my company (50K FTE). I
         | would be cautious to make any assumptions about how well anyone
         | is doing in the AI space without some real numbers. My cynical
         | opinion on enterprise software sales is that procurement
         | decisions have absolutely nothing to do with product cost,
         | performance, or value.
        
       | stared wrote:
       | Just a moment ago I tried to use Gemini 2.5 (in Cursor) to use
       | Python Gemini SDK. It failed, even with a few iterations.
       | 
       | Then run Claude 3.7 - it worked fine.
       | 
       | So yeah, depends on the case. But I am surprised that model
       | creators don't put extra effort into dealing with setting their
       | own tools.
        
       | Extropy_ wrote:
       | Why is Grok not in their benchmarks? I don't see comparisons to
       | Grok in any recent announcements about models. In fact, I see
       | practically no discussion of Grok on HN or anywhere except
       | Twitter in general.
        
         | nathanasmith wrote:
         | Is there an API for Grok yet? If not that could be the issue.
        
       | raffkede wrote:
       | I had huge success letting Gemini 2.5 oneshot whole codebases in
       | a single text file format and then split it up with a script.
       | It's putting in work for like 5 minutes and spits out a working
       | codebase, I also asked it to show of a little bit and it almost
       | one shotted a java cloud service to generate pdf invoices from
       | API calls, (made some minor mistakes but after feeding them back
       | it fixed them)
       | 
       | I basically use two scripts one to flatten the whole codebase
       | into one text file and one to split it, give it a shot it's
       | amazing...
        
         | archeantus wrote:
         | Can you please expound on this? You're using this approach to
         | turn an existing codebase into a single file and then asking
         | Gemini to make changes/enhancements? Does it also handle
         | breaking the files back out? Would love more info!
        
           | raffkede wrote:
           | I created a script that merges all files in a directory into
           | this format, and a counterpart that splits it again. Below is
           | just a small sample I asked it to create to show the format,
           | but I did it with almost 80 files including lots of
           | documentation etc.
           | 
           | When providing the flat format it was able to replicate it
           | without much instructions for a blank prompt i had success
           | with the prompt below
           | 
           | ===FILE=== Index: 1 Path:
           | src/main/java/com/example/myapp/Greeter.java Length: 151
           | Content: package com.example.myapp;
           | 
           | public class Greeter { public String getGreeting() { return
           | "Hello from the Greeter class!"; } } ===ENDFILE=== ===FILE===
           | Index: 2 Path: src/main/java/com/example/myapp/Main.java
           | Length: 222 Content: package com.example.myapp;
           | 
           | public class Main { public static void main(String[] args) {
           | Greeter greeter = new Greeter(); String message =
           | greeter.getGreeting(); System.out.println("Main app says: " +
           | message); } } ===ENDFILE=== ===FILE=== Index: 3 Path: pom.xml
           | Length: 659 Content: <?xml version="1.0" encoding="UTF-8"?>
           | <project xmlns="http://maven.apache.org/POM/4.0.0"
           | xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           | xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
           | http://maven.apache.org/xsd/maven-4.0.0.xsd">
           | <modelVersion>4.0.0</modelVersion>
           | <groupId>com.example</groupId>         <artifactId>my-simple-
           | app</artifactId>         <version>1.0-SNAPSHOT</version>
           | <properties>
           | <maven.compiler.source>17</maven.compiler.source>
           | <maven.compiler.target>17</maven.compiler.target>
           | <project.build.sourceEncoding>UTF-8</project.build.sourceEnco
           | ding>         </properties>
           | 
           | </project> ===ENDFILE===
           | 
           | Prompt to request the format if starting from scratch:
           | Present the entire codebase using the following multi-file
           | format:
           | 
           | The codebase should be presented as a single, monolithic text
           | output. Inside this output, represent each file of the
           | project individually using the following structure:
           | 
           | Start Marker: Each file must begin with the exact line:
           | ===FILE===
           | 
           | Metadata Block: Immediately following the start marker,
           | include these four specific metadata lines, each on its own
           | line:
           | 
           | Index: <N> (where <N> is a sequential integer index for the
           | file, starting from 1).
           | 
           | Path: <path/to/file/filename.ext> (The full relative path of
           | the file from the project's root directory, e.g., index.html,
           | css/style.css, js/script.js, jobs.html, etc.).
           | 
           | Length: <L> (where <L> is the exact character count of the
           | file's content that follows).
           | 
           | Content: (This literal line acts as a separator).
           | 
           | File Content: Immediately after the Content: line, include
           | the entire raw content of the file. Preserve all original
           | line breaks, indentation, and formatting exactly as it should
           | appear in the actual file.
           | 
           | End Marker: Each file's section must end with the exact line:
           | ===ENDFILE===
           | 
           | Ensure all necessary files for the project (HTML, CSS, JS)
           | are included sequentially within the single output block
           | according to this structure.
           | 
           | Crucially, enclose the entire multi-file output, starting
           | from the very first ===FILE=== line down to the very last
           | ===ENDFILE=== line, within a single Markdown fenced code
           | block using exactly five backticks (`````) on the lines
           | immediately before the first ===FILE=== and immediately after
           | the last `===ENDFILE===`. This ensures that any triple
           | backticks (```) within the generated file content are
           | displayed correctly.
        
           | ZeroTalent wrote:
           | There is a better way that I'm using:
           | 
           | 1. Cursor Pro with Sonnet to implement things the Cursor way.
           | 
           | 2. Install the Gemini Code extension in Cursor.
           | 
           | 3. Install the Gemini Coder Connector Chrome extension:
           | https://chromewebstore.google.com/detail/gemini-coder-
           | connec...
           | 
           | 4. Get the free aistudio.google.com Gemini API and connect
           | the extensions.
           | 
           | 5. Feed your codebase or select files via the Cursor
           | extension and get the implementation from
           | aistudio.google.com.
           | 
           | I prefer having Sonnet implement it via Cursor rather than
           | Gemini because it can automatically go through all the
           | linting/testing loops without my extra input, run the server,
           | and check if there are no errors.
        
         | mvdtnz wrote:
         | Anything that can fit in a single LLM output is not a
         | "codebase" it's just a start. Far too many people with no
         | experience in real software projects think their little 1800
         | line apps are representative of real software development.
        
       | phforms wrote:
       | I like using LLMs more as coding assistents than have them write
       | the actual code. When I am thinking through problems of code
       | organization, API design, naming things, performance
       | optimization, etc., I found that Claude 3.7 often gives me great
       | suggestions, points me in the right direction and helps me to
       | weigh up pros and cons of different approaches.
       | 
       | Sometimes I have it write functions that are very boilerplate to
       | save time, but I mostly like to use it as a tool to think through
       | problems, among other tools like writing in a notebook or drawing
       | diagrams. I enjoy programming too much that I'd want an AI to do
       | it all for me (it also helps that I don't do it as a job though).
        
       | anotherpaulg wrote:
       | Gemini 2.5 Pro set a wide SOTA on the aider polyglot coding
       | leaderboard [0]. It scored 73%, well ahead of the previous 65%
       | SOTA from Sonnet 3.7.
       | 
       | I use LLMs to improve aider, which is >30k lines of python. So
       | not a toy codebase, not greenfield.
       | 
       | I used Gemini 2.5 Pro for the majority of the work on the latest
       | aider release [1]. This is the first release in a very long time
       | which wasn't predominantly written using Sonnet.
       | 
       | The biggest challenge with Gemini right now is the very tight
       | rate limits. Most of my Sonnet usage lately is just when I am
       | waiting for Gemini's rate limits to cool down.
       | 
       | [0] https://aider.chat/docs/leaderboards/
       | 
       | [1] https://aider.chat/docs/faq.html#what-llms-do-you-use-to-
       | bui...
        
         | atonse wrote:
         | As someone who just adopted Cursor (and MCP) 2-3 weeks ago,
         | Aider seems like a different world.
         | 
         | The examples of "create a new simple video game" cause me to
         | glaze over.
         | 
         | Do you have a screencast of how you use aider to develop aider?
         | I'd love to see how a savvy expert uses these tools for real-
         | world solutions.
        
           | anotherpaulg wrote:
           | I actually get asked for screencasts a lot, so I made
           | recently made some [0].
           | 
           | The recording of adding support for 100+ new coding languages
           | with tree-sitter [1] shows some pretty advanced usage. It
           | includes using aider to script downloading a collection of
           | files, and using ad-hoc bash scripts to have aider modify a
           | collection of files.
           | 
           | [0] https://aider.chat/docs/recordings/
           | 
           | [1] https://aider.chat/docs/recordings/tree-sitter-language-
           | pack...
        
       | InTheArena wrote:
       | The amazing bit about claude code is it's ability to read code,
       | and fit into the existing code base. I tried visual studio code
       | w/ roo, and it blew up my 50 daily request limit immediately. Any
       | suggestions on better tooling for a claude code like experience
       | with Gemeni 2.5 pro?
        
       | sfjailbird wrote:
       | Every test task, including the coding test, is a greenfield
       | project. Everything I would consider using LLMs for is not. Like,
       | I would always need it to do some change or fix on a (large)
       | existing project. Hell, even the examples that were generated
       | would likely need subsequent alterations (ten times more effort
       | goes into maintaining a line of code than writing it).
       | 
       | So these tests are meaningless to me, as a measure of how useful
       | these models are. Great for comparison with each other, but would
       | be interesting to include some tests with more realistic work.
        
       | mvdtnz wrote:
       | I must be missing something about Gemini. When I use the web UI
       | it won't even let me upload source code files directly. If I
       | manually copy some code into a directory and upload that I do get
       | it to work, but the coding output is hilariously bad. It produces
       | ludicrously verbose code that so far for me has been 200% wrong
       | every time.
       | 
       | This is on a Gemini 2.5 Pro free trial. Also - god damn is it
       | slow.
       | 
       | For context this is on a 15k LOC project built about 75% using
       | Claude.
        
       | cadamsdotcom wrote:
       | Very nice comparison but constrained to greenfield.
       | 
       | Would love to see a similar article that uses LLMs to add a
       | feature to Gimp, or Blender.
        
       | ionwake wrote:
       | Sorry for the noob question, but claude has claudecode, does
       | Gemini Pro work with any software in the same way "claudecode"
       | works? If so what software would I use with it? Thank you.
        
         | simonw wrote:
         | Aider is worth a look.
         | 
         | The current rate limits for Gemini 2.5 Pro make it hard to run
         | something like Claude Code with it, since that tool is very API
         | chatty.
        
           | degrews wrote:
           | Hi Simon. Do you recommend aider over Cursor? I've always
           | used aider, and like it, but it just seems like Cursor is
           | overtaking it in terms of features, and I wonder if sticking
           | with aider still makes sense.
        
             | simonw wrote:
             | I don't actually use Aider or Cursor myself - I still
             | mostly work in the ChatGPT and Claude web interfaces (or
             | apps) directly and do a lot of copy and pasting.
        
         | degrews wrote:
         | Most people use Cursor. Aider and Cline are other options. All
         | of these work with all of the popular LLM APIs. Even among
         | people using Claude, I would bet more of them are using Claude
         | through Cursor than through Claude code.
        
       | asdf6969 wrote:
       | Does anyone know guides to integrate this with any kind of big co
       | production application? The examples are all small toy projects.
       | My biggest problems are like there's 4 packages I need to change
       | and 3 teams and half a dozen micro services are involved.
       | 
       | Does any LLM do this yet? I want to throw it at a project that's
       | in package and micro service hell and get a useful response. Some
       | weeks I spend almost all my time cutting tickets to other teams,
       | writing documents, and playing politics when the other teams
       | don't want me to touch their stuff. I know my organization is
       | broken but this is the world I live in.
        
       | benbojangles wrote:
       | Don't know what the fuss is about over a dino jump game, Claude
       | made me a flappy bird esp32 game last month in one go:
       | https://www.instagram.com/reel/DGcgYlrI_NK/?utm_source=ig_we...
        
       ___________________________________________________________________
       (page generated 2025-03-31 23:01 UTC)