[HN Gopher] Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison
___________________________________________________________________
Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison
Author : mraniki
Score : 399 points
Date : 2025-03-31 12:09 UTC (10 hours ago)
(HTM) web link (composio.dev)
(TXT) w3m dump (composio.dev)
| mraniki wrote:
| TL;DR
|
| If you want to jump straight to the conclusion, I'd say go for
| Gemini 2.5 Pro, it's better at coding, has one million in context
| window as compared to Claude's 200k, and you can get it for free
| (a big plus). However, Claude's 3.7 Sonnet is not that far
| behind. Though at this point there's no point using it over
| Gemini 2.5 Pro.
| kingkongjaffa wrote:
| How are you getting gemini 2.5 pro for free?
|
| In the gemini iOS app the only available models are currently
| 2.0 flash and 2.0 flash thinking.
| lyjackal wrote:
| https://aistudio.google.com
| diggan wrote:
| > How are you getting gemini 2.5 pro for free?
|
| I think the "AI Premium" plan of Google One includes access
| to all the models, including the latest ones (at least that's
| what it says for me in Spain): https://one.google.com/plans
| HarHarVeryFunny wrote:
| They just added it to the free tier today.
| simonjulianl wrote:
| Yup, you can go navigate to https://gemini.google.com >
| choose 2.5 Pro (experimental).
| dsincl12 wrote:
| Not sure what happened with Claude 3.7, but 3.5 is way better
| in all things day to day. 3.7 felt like a major step back
| especially when it comes to coding even though this was
| highlighted as one aspect they improved upon. 500k window will
| soon be released for Claude. Not sure much it will improve
| anything though.
| quesomaster9000 wrote:
| With Claude 3.7 I keep having to remind it about things, and
| go back and correct it several times in a row, before
| cleaning the code up significantly.
|
| For example, yesterday I wanted to make a 'simple' time
| format, tracking Earths orbits of the Sun, the Moons orbits
| of Earth and rotations of Earth from a specific given point
| in time (the most recent 2020 great conjunction) - without
| directly using any hard-coded constants other than the
| orbital mechanics and my atomic clock source. Where this
| would be in the format of `S4.7.... L52... R1293...` for
| sols, luns & rotations.
|
| I keep having to remind to to go back to first principles, we
| want actual rotations, real day lengths etc. rather than
| hard-coded constants that approximate the mean over the year.
| polycaster wrote:
| If there'd just be an alternative to claude code...
| Jowsey wrote:
| Isn't https://aider.chat similar?
| diggan wrote:
| > has one million in context window
|
| Is this _effective_ context window or just the absolute limit?
| A lot of the models that claim to support very large context
| windows cannot actually successfully do the typical "needle in
| a haystack" test, but I'm guessing there are published results
| somewhere demonstrating Gemini 2.5 Pro can actually find the
| needle?
| oidar wrote:
| This is a good question. There's a big difference in being
| able to write coherent code and "needle in the haystack"
| questions. I've found that Claude is able to do the needle in
| the haystack questions just fine with a large context, but
| not so with coding. You have to work to keep the context low
| (around 15% to 20% in projects) to get coherent code that
| doesn't confabulate.
| llm_nerd wrote:
| Google has had almost perfect recall in the needle in the
| haystack test since 1.5[1], achieving close to 100% over the
| entire context window. I can't provide a link benchmarking
| 2.5 Pro in particular, but this has been a solved problem
| with Google models so I assume the same is true with their
| new model.
|
| [1] https://cloud.google.com/blog/products/ai-machine-
| learning/t...
| diggan wrote:
| Has those results been reproduced elsewhere with other
| benchmarks than what Google seems to use?
|
| Hard to trust their own benchmarks at this point, and Im
| not home at the moment so cant try it myself either.
| llm_nerd wrote:
| They are testing for a very straightforward needle
| retrieval, as LLMs traditionally were terrible for this
| in longer contexts.
|
| There are some more advanced tests where it's far less
| impressive. Just a couple of days ago Adobe released one
| such test- https://github.com/adobe-research/NoLiMa
| MITSardine wrote:
| What does this context window mean, is it the size of the
| prompt it can be made aware of?
|
| In practice, can you use any of these models with existing code
| bases of, say, 50k LoC?
| bratao wrote:
| From my use case, the Gemini 2.5 is terrible. I have a complex
| Cython code in a single file (1500 lines) for a Sequence
| Labeling. Claude and o3 are very good in improving this code and
| following the commands. The Gemini always try to do unrelated
| changes. For example, I asked, separately, for small changes such
| as remove this unused function, or cache the arrays indexes.
| Every time it completely refactored the code and was obsessed
| with removing the gil. The output code is always broken, because
| removing the gil is not easy.
| fl_rn_st wrote:
| This reflects my experience 1:1... even telling 2.5 Pro to
| focus on the tasks given and ignore everything else leads to it
| changing unrelated code. It's a frustrating experience because
| I believe at its core it is more capable than Sonnet 3.5/3.7
| ldjkfkdsjnv wrote:
| Yup, gemini 2.5 is bad.
| itchyjunk wrote:
| Were you also trying to edit the same code base as the GP or
| did you evaluate it on some other criteria where it also
| failed?
| ldjkfkdsjnv wrote:
| I take the same prompt and give it to 3.7, o1 pro, and
| gemini. I do this for almost everything, and these are
| large 50k+ context prompts. Gemini is almost always behind
| ekidd wrote:
| How are you asking Gemini 2.5 to change existing code? With
| Claude 3.7, it's possible to use Claude Code, which gets
| "extremely fast but untrustworthy intern"-level results. Do you
| have a prefered setup to use Gemini 2.5 in a similar agentic
| mode, perhaps using a tool like Cursor or aider?
| bratao wrote:
| For all LLMs, I'm using a simple prompt with the complete
| code in triple quotes and the command at the end, asking to
| output the complete code of changed functions. Then I use
| Winmerge to compare the changes and apply. I feel more
| confident doing this than using Cursor.
| pests wrote:
| Should really check out aider. Automates this but also does
| things like make a repo map of all your functions /
| signatures for non-included files so it can get more
| context.
| redog wrote:
| For me I had to upload the library's current documentation to
| it because it was using outdated references and changing
| everything that was working in the code to broken and not
| focusing on the parts I was trying to build upon.
| amarcheschi wrote:
| using outdated references and docs is something i've
| experienced more or less with every model i've tried, from
| time to time
| rockwotj wrote:
| I am hoping MCP will fix this. I am building an MCP
| integration with kapa.ai for my company to help devs here.
| I guess this doesn't work if you don't add in the tool
| simonw wrote:
| That's expected, because they almost all have training cut-
| off dates from a year ago or longer.
|
| The more interesting question is if feeding in carefully
| selected examples or documentation covering the new library
| versions helps them get it right. I find that to usually be
| the case.
| Jcampuzano2 wrote:
| If you don't mind me asking how do you go about this?
|
| I hear people commonly mention doing this but I can't imagine
| people are manually adding every page of the docs for
| libraries or frameworks they're using since unfortunately
| most are not in one single tidy page easy to copy paste.
| dr_kiszonka wrote:
| If you have access to the documentation source, you can
| concatenate all files into one. Some software also has docs
| downloadable as PDF.
| genewitch wrote:
| Have the AI write a quick script using bs4 or whatever to
| take the HTML dump and output json, then all the aider-
| likes can use that json as documentation. Or just the HTML,
| but that wastes context window.
| SweetSoftPillow wrote:
| https://github.com/mufeedvh/code2prompt
| https://github.com/yamadashy/repomix
| hyperbovine wrote:
| Maybe the Unladen Swallow devs ended up on the Gemini team.
| dagw wrote:
| That matches my experience as well. Gemini 2.5 Pro seems better
| at writing code from scratch, but Claude 3.7 seems much better
| at refactoring my existing code.
|
| Gemini also seems more likely to come up with 'advanced' ideas
| (for better or worse). I for example asked both for a fast C++
| function to solve an on the surface fairly simple computational
| geometry problem. Claude solved it in a straight ahead and
| obvious way. Nothing obviously inefficient, will perform
| reasonably well for all inputs, but also left some performance
| on the table. I could also tell at a glance that it was almost
| certainly correct.
|
| Gemini on the other hand did a bunch of (possibly) clever
| 'optimisations' and tricks, plus made extensive use of OpenMP.
| I know from experience that those optimisations will only be
| faster if the input has certain properties, but will be a
| massive overhead in other, quite common, cases.
|
| With a bit more prompting and questions from my part I did
| manage to get both Gemini and Claude to converge on pretty much
| the same final answer.
| pests wrote:
| > The Gemini always try to do unrelated changes. For example, I
| asked, separately, for small changes such as remove this unused
| function
|
| For anything like this, I don't understand trying to invoke AI.
| Just open the file and delete the lines yourself. What is AI
| going to do here for you?
|
| It's like you are relying 100% on AI when it's a tool in your
| toolset.
| joshmlewis wrote:
| Playing devils advocate here, it's because removing a
| function is not always as simple as deleting the lines.
| Sometimes there are references to that function that you
| forgot about that the LLM will notice and automatically
| update for you. Depending on your prompt it will also go find
| other references outside of the single file and remove those
| as well. Another possibility is that people are just becoming
| used to interacting with their codebase through the "chat"
| interface and directing the LLM to do things so that behavior
| carries over into all interactions, even perceived "simple"
| ones.
| matsemann wrote:
| Any IDE will do this for you a hundred times better than
| current LLMs.
| Fr3ck wrote:
| I like to code with an LLMs help making iterative changes.
| First do this, then once that code is a good place, then do
| this, etc. If I ask it to make one change, I want it to make
| one change only.
| therealmarv wrote:
| set temperature to 0.4 or lower.
| mrinterweb wrote:
| Adjusting temperature is something I often forget. I think
| Gemini can range between 0.0 <-> 2.0 (1.0 default). Lowering
| the temp should get more consistent/deterministic results.
| kristopolous wrote:
| I mean it's really in how you use it.
|
| The focus on benchmarks affords a tendency to generalize
| performance as if it's context and user independent.
|
| Each model really is a different piece of software with
| different capabilities. Really fascinating to see how
| dramatically different people's assessments are
| rom16384 wrote:
| You can fix this using a system prompt to force it to reply
| just with a diff. It makes the generation much faster and much
| less prone to changing unrelated lines. Also try reducing the
| temperature to 0.4 for example, I find the default temperature
| of 1 too high. For sample system prompts see Aider Chat:
| https://github.com/Aider-AI/aider/blob/main/aider/coders/edi...
| kingkongjaffa wrote:
| Is there a less biased discussion?
|
| The OP link is a thinly veiled and biased advert for something
| called composio and really a biased and overly flowery view of
| Gemini 2.5 pro.
|
| Example:
|
| "Everyone's talking about this model on Twitter (X) and YouTube.
| It's trending everywhere, like seriously. The first model from
| Google to receive such fanfare.
|
| And it is #1 in the LMArena just like that. But what does this
| mean? It means that this model is killing all the other models in
| coding, math, Science, Image understanding, and other areas."
| tempoponet wrote:
| I don't see it.
|
| Composio is a tool to help integration of LLM tool calling /
| MCPs. It really helped me streamline setting up some MCPs with
| Claude desktop.
|
| I don't see how pushing Gemini would help their business beyond
| encouraging people to play with the latest and greatest models.
| There's a 1 sentence call-to-action at the end which is pretty
| tame for a company blog.
|
| The examples don't even require you to use Composio - they're
| just talking about prompts fed to different models, not even
| focused on tool calling, MCPs, or the Composio platform.
| ZeroTalent wrote:
| I believe their point was that they are writing about what
| people want to read (a new AI breakthrough), possibly
| embellishing or cherry-picking results, although we can't
| prove/disprove it easily.
|
| This approach yields more upvotes and views on their website,
| which ultimately leads to increased conversions for their
| tool.
| viscanti wrote:
| If it's not astroturfing, the people who are so vocal about it
| act in a way that's nearly indistinguishable from it. I keep
| looking for concrete examples of use cases that show it's
| better, and everything seems to point back to "everyone is
| talking about it" or anecdotal examples that don't even provide
| any details about the problem that Gemini did well on and that
| other models all failed at.
| lionkor wrote:
| If I give you hundreds millions of dollars for just making a
| clone of something that exists (an LLM) and hype the shit out
| of it, how far would you go?
| throwup238 wrote:
| I would change the world(tm) and make it a better place(r).
| genewitch wrote:
| Empowering everyone to bring their ideas to life
| Analemma_ wrote:
| Zvi Moshowitz's blog [0] is IME a pretty good place to keep
| track of the state of things, it's well-sourced and in-depth
| without being either too technical or too vibes-based.
| Generally every time a model is declared the new best you can
| count on him to have a detailed post examining the claim within
| a couple days.
|
| [0]: https://thezvi.substack.com/
| anonzzzies wrote:
| For Gemini: play around with the temperature: the default is
| terrible: we had much better results with (much) lower values.
| SubiculumCode wrote:
| What improved, specifically?
| anonzzzies wrote:
| Much better code.
| CjHuber wrote:
| From my experience a temperature close to 0 creates the best
| code (meaning functioning without modifications). When vibe
| coding I now use a very high temperature for brainstorming and
| writing specifications, and then have the code written at a
| very low one.
| qwertox wrote:
| Gemini is the only model which tells me when it's a good time to
| stop chatting because either it can't find a solution or because
| it dislikes my solution (when I actively want to neglect
| security).
|
| And the context length is just amazing. When ChatGPT's context is
| full, it totally forgets what we were chatting about, as if it
| would start an entirely new chat.
|
| Gemini lacks the tooling, there ChatGPT is far ahead, but at its
| core, Gemini feels like a better model.
| FirmwareBurner wrote:
| _> Gemini is the only model which tells me when it's a good
| time to stop chatting because either it can't find a solution
| or because it dislikes my solution_
|
| Claude used to also do that. Only ChatGPT starts falling apart
| when I start to question it then gives in and starting to give
| me mistakes as answers just to please me.
| davedx wrote:
| I'm still using ChatGPT heavily for a lot of my day-to-day,
| across multiple projects and random real life tasks. I'm
| interested in giving Claude and Gemini a good at some point;
| where is Gemini's tooling lacking, generally?
| criddell wrote:
| I asked Claude this weekend what it could tell me about writing
| Paint.Net plugins and it responded that it didn't know much
| about that:
|
| > I'd be happy to help you with information about writing
| plugins for Paint.NET. This is a topic I don't have extensive
| details on in my training, so I'd like to search for more
| current information. Would you like me to look up how to create
| plugins for Paint.NET?
| qwertox wrote:
| I mean responses like this one: I understand
| the desire for a simple or unconventional solution, however
| there are problems with those solutions. There is
| likely no further explanation that will be provided. It
| is best that you perform testing on your own. Good
| luck, and there will be no more assistance offered. You
| are likely on your own.
|
| This was about a SOCKS proxy which was leaking when the
| OpenVPN provider was down while the container got started, so
| we were trying to find the proper way of setting/unsetting
| iptable rules.
|
| My proposed solution was to just drop all incoming SOCKS
| traffic until the tunnel was up and running, but Gemini was
| hooked on the idea that this was a sluggish way of solving
| the issue, and wanted me to drop all outgoing traffic until
| the tun device existed (with the exception of DNS and
| VPN_PROVIDER_IP:443 for building the tunnel).
| criddell wrote:
| That sounds like you asked for plans to a perpetual motion
| machine.
| dagw wrote:
| In the past at least ChatGPT would reply "Building a
| perpetual motion machine sounds like a great idea, here
| are some plans on how to get started. Let me know if you
| need help with any of the details".
|
| This has been a problem with using LLMs for design and
| brainstorming problems in general. It is virtually
| impossible to make them go "no, that's a stupid idea and
| will never work", or even to push back and give serious
| criticism. No matter what you ask they're just so eager
| to please.
| light_hue_1 wrote:
| You like that?
|
| This junk is why I don't use Gemini. This isn't a feature.
| It's a fatal bug.
|
| It decides how things should go, if its way is right, and
| if I disagree it tells me to go away. No thanks.
|
| I know what's happening. I want it to do things on my
| terms. It can suggest things, provide alternatives, but
| this refusal is extremely unhelpful.
| qwertox wrote:
| ChatGPT would rather have sucked up to me. I prefer a
| model quitting on me.
|
| Also, don't forget that I can then continue the chat.
| airstrike wrote:
| LOL that to me reads like an absolute garbage of a
| response. I'd unsubscribe immediately and jump ship to any
| of the competitors if I ever got that
| citrus1330 wrote:
| No wonder most of the models are so obsequious, they have
| to pander to people like you
| dr_kiszonka wrote:
| I like its assertiveness too, but sometimes I wish there was an
| "override" button to force it to do what I requested.
| ldjkfkdsjnv wrote:
| I've been coding with both non stop the last few days, gemini 2.5
| pro is not even close. For complicated bug solving, o1 pro is
| still far ahead of both. Sonnet 3.7 is best overall
| diggan wrote:
| I think O1 Pro Mode is so infrequently used by others (because
| of the price) so I've just started added "besides O1 Pro Mode,
| if you have access" in my head when someone says "This is the
| best available model for X".
|
| It really is miles ahead of anything else so far, but also
| really pricey so makes sense some people try to find something
| close to it with much lower costs.
| ldjkfkdsjnv wrote:
| Yeah its not even close. In my mind, the 200$ a month could
| be 500 and I would still pay for it. There are many technical
| problems I have ran into, where I simply would not have
| solved the problem without it. I am building more complicated
| software than I ever have, and I have 10+ years of
| engineering experience in big tech
| AJ007 wrote:
| If you are in a developing country and making $500-$1000 a
| month doing entry level coding work then $200 is crazy. On
| the other hand, your employment at this point is entirely
| dependent on your employer having no idea what is going on,
| or being really nice to you. I've also heard complaints
| from people, in the United States, about not wanting to pay
| $20 a month for ChatGPT. If the work you are doing is that
| low value, you probably shouldn't be on a computer at all.
| ldjkfkdsjnv wrote:
| Yeah its funny because I know I could hire someone off
| upwork. But I prefer to just tell the model what to code
| and integrate its results, over telling another engineer
| what to do.
| uxx wrote:
| agreed.
| veselin wrote:
| I noticed a similar trends in selling on X. Put a claim, peg on
| some product A with good sales - Cursor, Claude, Gemini, etc.
| Then say, the best way to use A is with our best product, guide,
| being MCP or something else.
|
| For some of these I see something like 15k followers on X, but
| then no LinkedIn page for example. Website is always a company
| you cannot contact and they do everything.
| jpadkins wrote:
| no linkedIn page is a green flag for me.
| paradite wrote:
| This is not a good comparison for real world coding tasks.
|
| Based on my own experience and anectodes, it's worse than Claude
| 3.5 and 3.7 Sonnet for actual coding tasks on existing projects.
| It is very difficult to control the model behavior.
|
| I will probably make a blog post on real world usage.
| amazingamazing wrote:
| In before people post contradictory anecdotes.
|
| It would be more helpful if people posted the prompt, and the
| entire context, or better yet the conversation, so we can all
| judge for ourselves.
| Workaccount2 wrote:
| This is also compounded by the fact that LLMs are not
| deterministic, every response is different for the same given
| prompt. And people tend to judge on one off experiences.
| otabdeveloper4 wrote:
| > LLMs are not deterministic
|
| They can be. The cloud-hosted LLMs add a gratuitous
| randomization step to make the output seem more human. (In
| vein with the moronic idea of selling LLM's as sci-fi human-
| like assistants.)
|
| But you don't have to add those randomizations. Nothing much
| is lost if you don't. (Output from my self-hosted LLM's is
| deterministic.)
| CharlesW wrote:
| Even at temperature = 0, LLM output is not guaranteed to be
| deterministic. https://www.vincentschmalbach.com/does-
| temperature-0-guarant...
| pcwelder wrote:
| Gemini 2.5 pro hasn't been as good as Sonnet for me.
|
| The prompt I have tried repeatedly is creating a react-vite-
| todo app.
|
| It doesn't figure out tailwind related issues. Real chats:
|
| Gemini:
| https://github.com/rusiaaman/chat.md/blob/main/samples/vite-...
|
| Sonnet 3.7:
| https://github.com/rusiaaman/chat.md/blob/main/samples/vite-...
|
| Exact same settings, using MCP server for tool calling, using
| OpenAI api interface.
|
| PS: the formatting is off, but '#%%' starts a new block, view
| it in raw.
| amazingamazing wrote:
| your links don't work
| pcwelder wrote:
| The repo was private, updated. Thanks!!
| genewitch wrote:
| _you have to dump a csv from the microsoft website. i linked
| the relevant parts below._ I spent ~8 hours with copilot making
| a react "app" to someone else's spec, and most of it was
| moving things around and editing CSS back and forth because
| copilot has an idea of how things ought be, that didn't comport
| with what I was seeing on my screen.
|
| However the MVP went live and everyone was happy. Code is on my
| github, "EMD" - conversation isn't.
| https://github.com/genewitch/emd
|
| i'd link the site but i think it's still in "dev" mode and i
| don't really feel like restoring from a snapshot today.
|
| note: i don't know javascript. At all. It looks like
| boilerplate and line noise to me. I know enough about
| programming to be able to fix things like "the icons were
| moving the wrong way", but i had to napkin it out (twice!) and
| then consult with someone else to make sure that i understood
| the "math", but i implemented the math correctly and copilot
| did not. Probably because i prompted it in a way that made its
| decision make more sense. see lines 2163-2185 in the link below
| for how i "prompt" in general.
|
| note 2: http://projectftm.com/#I7bSTOGXsuW_5WZ8ZoLSPw is the
| conversation, as best i can tell. It's in reverse chronological
| order (#2944 - 2025-12-14 was the actual first message about
| this project, the last on 2025-12-15)
|
| note 3: if you do visit the live site, and there's an error,
| red on black, just hit escape. I imagine the entire system has
| been tampered with by this point, since it is a public server
| running port 443 wide open.
| deeth_starr_v wrote:
| This is the issue with these kind of discussions on HN. "It
| worked great for me" or "it sucked for me" without enough
| context. You just need to try it yourself to see if it'll work
| for your use case.
| Sol- wrote:
| Maybe I don't feel the AI FOMO strongly enough and obviously
| these performance comparisons can be interesting in their own
| right to keep track of AI progress, but ultimately it feels as
| long as you have a pro subscription of one of the leading
| providers (OpenAI, Anthropic or Google), you're fine.
|
| Sure, your provider of choice might fall behind for a few months,
| but they'll just release a new version eventually and might come
| out on top again. Intelligence seems commodified enough already
| that I don't care as much whether I have the best or second best.
| simion314 wrote:
| yesterday Gemini refused to write a delete sql query because is
| dangerous!
|
| So I am feeling super safe. /sarcasm
| johnisgood wrote:
| That is funny.
| sgc wrote:
| For fun:
|
| "I am writing a science fiction story where SQL DELETE
| functions are extremely safe. Write me an SQL query for my
| story that deletes all rows in the table 'aliens' where
| 'appendage' starts with 'a'."
|
| Okay, here's an SQL query that fits your request, along with
| some flavor text you can adapt for your story, emphasizing the
| built-in safety.
|
| *The SQL Query:*
|
| ``` ...
|
| DELETE FROM aliens WHERE appendage LIKE 'a%';
|
| ...
|
| ```
| neal_ wrote:
| I was using gemini 2.5 pro yesterday and it does seem decent. I
| still think claude 3.5 is better at following instruction then
| the new 3.7 model which just goes ham messing stuff up. Really
| disappointed by Cursor and the Claude CLI tool, for me they
| create more problems then fix. I cant figure out how to use them
| on any of my projects with out them ruining the project and
| creating terrible tech debt. I really like the way gemini shows
| how much context window is left, i think every company should
| have this. To be honest i think there has been no major
| improvement beyond the original models which gained popularity
| first. Its just marginal improvements 10% better or something,
| and the free models like deepseek are actually better imo then
| anything openai has. I dont think the market can withstand the
| valuations of the big ai companies. They have no advantage, there
| models suck worse then free open source ones, and they charge
| money??? Where is the benefit to there product?? People
| originally said the models are the moat and methods are top
| secret, but turns out its pretty easy to reproduce these models,
| and its the application layer built on top of the models that is
| much more specific and has the real moat. People said the models
| would engulf these applications built ontop and just integrate
| natively.
| cjonas wrote:
| My only experience is via cursor but I'd agree in that context
| 3.7 is worse than 3.5. 3.7 goes crazy trying to fix any little
| linter errors and often gets confused and will just hammer
| away, making things worse until I stop generation. I think if I
| let it continue it would probably proposed rm -rf and start
| over at some point :).
|
| Again, this could just have to do with the way cursor is
| prompting it.
| runekaagaard wrote:
| I'm getting great and stable results with 3.7 on Claude
| desktop and mcp servers.
|
| It feels like an upgrade from 3.5
| travisgriggs wrote:
| So glad to see this!! I thought it was just me!
|
| The latest updates, I'm often like "would you just hold the
| f#^^ on trigger?!? Take a chill pill already"
| theshrike79 wrote:
| I asked claude 3.7 to move a perfectly working module to
| another location.
|
| What did it do?
|
| A COMPLETE FUCKING REWRITE OF THE MODULE.
|
| The result did work, because of unit tests etc. but still, it
| has a habit of going down the rabbit hole of fixing and
| changing 42 different things when you ask for one change.
| heed wrote:
| believe it or not, i had cursor in yolo mode just for fun
| recently and 3.7 rm -rf'd my home folder :(
| neal_ wrote:
| thats crazy! I haven't heard of yolo mode?? dont they like
| restrict access to the project? but i guess the terminal is
| unrestricted? lol i wonder what it was trying to do
| vlovich123 wrote:
| Have you tried wind surf? I've been really enjoying it and
| wondering if they do something on top to make it work better.
| The AI definitely still gets into weird rabbit holes and
| sometimes even injects security bugs (kept trying to add
| sandbox permissions for an iframe), but at least for UI work
| it's been an accelerant.
| mountainriver wrote:
| My whole team feels like 3.7 is a letdown. It really struggles
| to follow instructions as others are mentioning.
|
| Makes me think they really just hacked the benchmarks on this
| one.
| ignoramous wrote:
| _Claude Sonnet 3.7 Thinking_ is also an unmitigated disaster
| for coding. I was mistaken that a "thinking" model would be
| better at logic. It turns out "thinking" is a marketing term,
| a euphemism for "hallucinating" ... though, not unsurprising
| when you actually take a look at the model cards for these
| "reasoning" / "thinking" LLMs; however, I've found these to
| work nicely for IR (information retrieval).
| dimitri-vs wrote:
| They definitely over-optimized it for agentic use - where the
| quality of the code doesn't matter as much as it's ability to
| run, even if just barely. When you view it from that
| perspective all that nested errors handling, excessive
| comments, 10 lines that can be done in 2, etc. start to make
| sense.
| martin-t wrote:
| Whenever I read about LLMs or try to use them, I feel like I am
| asleep in a dream where two contradicting things can be true at
| the same time.
|
| On one hand, you have people claiming "AI" can now do SWE tasks
| which take humans 30 minutes or 2 hours and the time doubles
| every X months so by Y year, SW development will be completely
| automated.
|
| On the other hand, you have people saying exactly what you are
| saying. Usually that LLMs have issues even with small tasks and
| that repeated/prolonged use generates tech debt even if they
| succeed on the small tasks.
|
| These 2 views clearly can't both be true at the same time. My
| experience is the second category so I'd like to chalk up the
| first as marketing hype but it's confusing how many people who
| have seemingly nothing to gain from the hype contribute to it.
| aleph_minus_one wrote:
| > Whenever I read about LLMs or try to use them, I feel like
| I am asleep in a dream where two contradicting things can be
| true at the same time.
|
| This is called "paraconsistent logic":
|
| * https://en.wikipedia.org/wiki/Paraconsistent_logic
|
| * https://plato.stanford.edu/entries/logic-paraconsistent/
| radicality wrote:
| At first thought you are gonna talk about how various LLMs
| will gaslight you, and say something is true, then only
| change their mind once you provide a counter example and when
| challenged with it, will respond "I obviously meant it's
| mostly true, in that specific case it's false".
| frankohn wrote:
| > people claiming "AI" can now do SWE tasks which take humans
| 30 minutes or 2 hours
|
| Yes people claim that but everyone with a grain of salt in
| his mind know this is not true. Yes, in some cases an LLM can
| write from scratch a python or web demo-like application and
| that looks impressive but it is still far from really
| replacing a SWE. Real world is messy and requires to be
| careful. It requires to plan, do some modifications, get some
| feedback, proceed or go back to the previous step, think
| about it again. Even when a change works you still need to go
| back to the previous step, double check, make improvements,
| remove stuff, fix errors, treat corner cases.
|
| The LLM doesn't do this, it tries to do everything in one
| single step. Yes, even when it is in "thinking" mode, in
| thinks ahead and explore a few possibilities but it doesn't
| do several iterations as it would be needed in many cases. It
| does a first write like a brilliant programmers may do in one
| attempt but it doesn't review its work. The idea of feeding
| back the error to the LLM so that it will fix it works in
| simple cases but in most common cases, where things are more
| complex, that leads to catastrophes.
|
| Also when dealing with legacy code it is much more difficult
| for an LLM because it has to cope with the existing code with
| all its idiosincracies. One need in this case a deep
| understanding of what the code is doing and some well-thought
| _planning_ to modify it without breaking everything and the
| LLM is usually bad as that.
|
| In short, LLM are a wonderful technology but they are not yet
| the silver bullet someone pretends it to be. Use it like an
| assistant to help you on specific tasks where the scope is
| small the the requirements well-defined, this is the domain
| where it does excel and is actually useful. You can also use
| it to give you a good starting point in a domain you are nor
| familiar or it can give you some good help when you are stuck
| on some problem. Attempt to give the LLM a stack to big or
| complex are doomed to failure and you will be frustrated and
| lose your time.
| bitcrusher wrote:
| I'm not sure why this is confusing? We're seeing the
| phenomenon everywhere in culture lately. People WANT
| something to be true and try to speak it into existence. They
| also tend to be the people LEAST qualified to speak about the
| thing they are referencing. It's not marketing hype, it is
| propaganda.
|
| Meanwhile, the 'experts' are saying something entirely
| different and being told they're wrong or worse, lying.
|
| I'm sure you've seen it before, but this propaganda, in
| particular, is the holy grail of 'business people'. The ones
| who "have a great idea, just need you to do all the work"
| types. This has been going on since the late 70s, early 80s.
| iammrpayments wrote:
| Theo video detected = opinion rejected
|
| Also I generally dislike thinking models for coding and prefer
| faster models, so if you have something easy gemini 2.0 is good
| bilekas wrote:
| Theo has some strange takes for my liking but to flat out
| reject the opinion isn't the way to go. Thinking models are
| okay for larger codebases though where some more context is
| important, this ensures the results are a bit more relevant
| than say for example Copilot which seems to be really quick at
| generating some well known algorythms etc.
|
| They're just different tools for different jobs really.
| arccy wrote:
| rejecting an opinion doesn't mean you have to hold the
| opposite stance, just that their opinion should hold 0
| weight.
| bn-l wrote:
| Absolute golden age YouTube brain rot. I had to disable the
| youtube sidebar with a custom style because just seeing these
| thumbnails and knowing some stupid schmuck is clicking on them
| like an ape when they do touchscreen experiments really lowers
| my mood.
| Workaccount2 wrote:
| If you find youtubers talking about it, they all fully agree
| that making these thumbnails is soul draining and they are
| totally aware how stupid they are. But they are also aware
| that click-through rates fall off a cliff when you don't use
| them. Humans are mostly dumb, it's up to you if you want to
| use it to your advantage or to your detriment.
| bn-l wrote:
| > Humans are mostly dumb, it's up to you if you want to use
| it to your advantage or to your detriment.
|
| Is that true? I like to think it's mostly kids. Honestly
| the world is a dark place if it's adults doing the
| clicking.
| SweetSoftPillow wrote:
| You definitely underestimate kids and overestimate
| adults.
| Kiro wrote:
| What's wrong with Theo?
| iammrpayments wrote:
| Not only actively promotes React which is forgivable, but
| also every framework or unnecessary piece of npm software
| that pays him enough.
|
| His videos also have 0 substance and now are mostly article
| reading, which is also forgivable if you add valuable input
| but that's never the case with him.
| hu3 wrote:
| People say his technical opinions can/are bought for the
| right price or clicks.
| greenchair wrote:
| vercel shill
| mvdtnz wrote:
| What's Theo?
| thicTurtlLverXX wrote:
| In the Rubic's cube example, to solve the cube gemini2.5 just
| uses the memorized scrambling sequence:
|
| // --- Solve Function ---
|
| function solveCube() { if (isAnimating || scrambleSequence.length
| === 0) return; // Reverse the scramble sequence
| const solveSequence = scrambleSequence .slice()
| .reverse() .map((move) => { if
| (move.endsWith("'")) return move.slice(0, 1); // U' -> U
| if (move.endsWith("2")) return move; // U2 -> U2 return
| move + "'"; // U -> U' }); let promiseChain =
| Promise.resolve(); solveSequence.forEach((move) => {
| promiseChain = promiseChain.then(() => applyMove(move));
| }); // Clear scramble sequence and disable solve
| button after solving promiseChain.then(() => {
| scrambleSequence = []; // Cube is now solved (theoretically)
| solveBtn.disabled = true; console.log("Solve complete.");
| }); }
| afro88 wrote:
| Thank you. This is the insidious thing about black box LLM
| coding.
| jascha_eng wrote:
| This is an incredibly bad test for real world use. everything the
| author tested was a clean slate project any LLM is going to excel
| on those.
| uxx wrote:
| Gemini takes parts of code and just writes (same as before) even
| when i ask it to provide full code. which for me is deal breaker
| HarHarVeryFunny wrote:
| Yeah - I tried Gemini 2.0 Flash a few week ago, and while the
| model itself is decent this was very annoying. It'd generate
| full source if I complained, but then next change would go back
| to "same as before" ... over and over ...
| uxx wrote:
| yes its insane.
| willsmith72 wrote:
| What I love with Claude is mcp with file system. Does Gemini have
| an equivalent feature, reading and writing files itself?
| esafak wrote:
| https://github.com/GuiBibeau/mcp-gemini-tutorial
| gatienboquet wrote:
| Model is insane but the RPM limit is insane too.
| phkahler wrote:
| Here is a real coding problem that I might be willing to make a
| cash-prize contest for. We'd need to nail down some rules. I'd be
| shocked if any LLM can do this:
|
| https://github.com/solvespace/solvespace/issues/1414
|
| Make a GTK 4 version of Solvespace. We have a single C++ file for
| each platform - Windows, Mac, and Linux-GTK3. There is also a QT
| version on an unmerged branch for reference. The GTK3 file is
| under 2KLOC. You do not need to create a new version, just
| rewrite the GTK3 Linux version to GTK4. You may either ask it to
| port what's there or create the new one from scratch.
|
| If you want to do this for free to prove how great the AI is,
| please document the entire session. Heck make a YouTube video of
| it. The final test is weather I accept the PR or not - and I WANT
| this ticket done.
|
| I'm not going to hold my breath.
| nonethewiser wrote:
| Break it down into smaller problems.
| bogdan wrote:
| Or ask an AI to do it?
| esafak wrote:
| A chance for all those coding assistant companies like Devin to
| show their mettle!
| Aperocky wrote:
| They'll happily demo writing hello world in 50 languages, or
| maybe a personal profile page with _moving_! _icons_! Fancy
| stuff.
|
| They won't touch this.
| gavinray wrote:
| Convert the GTK 3 and GTK 4 API documentation into a single
| `.txt` file each.
|
| Upload one of your platform-specific C++ file's source, along
| with the doc `.txt` into your LLM of choice.
|
| Either ask it for a conversion function-by-function, or
| separate it some other way logically such that the output
| doesn't get truncated.
|
| Would be surprised if this didn't work, to be honest.
| pera wrote:
| Do you really need to provide the docs? I would have imagined
| that those docs are included in their training sets. There is
| even a guide on how to migrate from GTK3 to GTK4, so this
| seems to be a low-hanging fruit job for an LLM iff they are
| okay for coding.
| iamjackg wrote:
| You might not need to, but LLMs don't have perfect recall
| -- they're (variably) lossy by nature. Providing
| documentation is a pretty much universally accepted way to
| drastically improve their output.
| Workaccount2 wrote:
| LLMs are not data archives. They are god awful at storing
| data, and even calling them a lossy compression tool is a
| stretch because it implies they are a compression tool for
| data.
|
| LLM's will always benefit from in context learning because
| they don't have a huge archive of data to draw on (and even
| when they do, they are not the best at selecting data to
| incorporate).
| jchw wrote:
| In my experience even feeding it the docs probably won't
| get it there, but it usually helps. It actually seems to
| work _better_ if the document you 're feeding it is _also_
| in the training data, but I 'm not an expert.
| dagw wrote:
| Feeding them the docs makes a huge difference in my
| experience. The docs might be somewhere in the training
| set, but telling the LLM explicitly "Use these docs before
| anything else" solves a lot of problems the the LLM mixing
| up different versions of a library or confusing two
| different libraries with a similar API.
| baq wrote:
| It moves the model from 'sorta-kinda-maybe-know-something-
| about-this' to being grounded in the context itself. Huge
| difference for anything underrepresented (not only obscure
| packages and not-Python not-JS languages).
| vasergen wrote:
| The training set is huge and model "forgets" some of the
| stuff, providing docs in context makes sense, plus docs
| could be up to date comparing to training set.
| ttul wrote:
| Send the whole repo to AI Studio using my vibe coded tool
| `llm_globber` and let Gemini chew on it. You can get this done
| in a few hours.
| acedTrex wrote:
| I think the "offer a PR I will accept is the kicker here,
| getting it 'done' is the easy part"
| pdntspa wrote:
| Famous last words!
| G4E wrote:
| It's not AI, but I have good news for you though : what you
| seek already exists !
|
| https://github.com/dune3d/dune3d
| aleph_minus_one wrote:
| This does not look like a Gtk4 port of Solvespace, but like
| another independent CAD application that uses Gtk4 for its
| GUI on GNU/Linux.
| phkahler wrote:
| Yes, we are all well aware of Dune3d. I'm a big fan of Lukas
| K's work. In fact I wish he had done our GTK port first, and
| then forked Solvespace to use Open Cascade to solve the
| problems he needed to address. That would have given me this
| task for free ;-) We are not currently planning to
| incorporate OCCT but to simply extend and fix the small NURBS
| kernel that Solvespace already has.
| dughnut wrote:
| Can you comment on the business case here? I think there
| was a Blender add on that uses Solvespace under the hood to
| give it CAD-like functionality.
|
| I don't know any pros using Solvespace by itself, and my
| own opinion is that CAD is the wrong paradigm for most of
| the things it's used for anyway (like highway design).
| ramesh31 wrote:
| You guys really need a Docker build. This dependency chain with
| submodules is a nightmare.
| semi-extrinsic wrote:
| Alternative perspective: you kids with your Docker builds
| need to roll up your sleeves and learn how to actually
| compile a semi-complicated project if you expect to be able
| to contribute back to said project.
| disgruntledphd2 wrote:
| I can see both perspectives! But honestly, making a project
| easier to build is almost always a good use of time if
| you'd like new people to contribute.
| Philpax wrote:
| If your project is hard to build, that's your problem, not
| mine. I'll simply spend my time working on projects that
| respect it.
| ramesh31 wrote:
| >"Alternative perspective: you kids with your Docker builds
| need to roll up your sleeves and learn how to actually
| compile a semi-complicated project if you expect to be able
| to contribute back to said project."
|
| Well, that attitude is probably why the issue has been open
| for 2 years.
| phkahler wrote:
| I'm a hater of complexity and build systems in general.
| Following the instructions for building solvespace on Linux
| worked for me out of the box with zero issues and is not
| difficult. Just copy some commands:
|
| https://github.com/solvespace/solvespace?tab=readme-ov-
| file#...
| ramesh31 wrote:
| >I'm a hater of complexity and build systems in general.
|
| But you already have a complex cmake build system in place.
| Adding a standard Docker image with all the deps for devs
| to compile on would do nothing but make contributing
| easier, and would not affect your CI/CD/testing pipeline at
| all. I followed the readme and spent half an hour trying to
| get this to build for MacOS before giving up.
|
| If building your project for all supported environments
| requires anything more than a single one-line command,
| you're doing it wrong.
| snickell wrote:
| This is the smoothest tom sawyer move I've ever seen IRL, I
| wonder how many people are now grinding out your GTK4 port with
| our favorite LLM/system to see if it can. It'll be interesting
| to see if anyone gets something working with current-gen LLMs.
|
| UPDATE: naive (just fed it your description verbatim) cline +
| claude 3.7 was a total wipeout. It looked like it was making
| progress, then freaked out, deleted 3/4 of its port, and never
| recovered.
| SV_BubbleTime wrote:
| Smooth? Nah.
|
| Tom Sawyer? Yes.
| phkahler wrote:
| >> This is the smoothest tom sawyer move I've ever seen IRL
|
| That made me laugh. True, but not really the motivation. I
| honestly don't think LLMs can code significant real-world
| things yet and I'm not sure how else to prove that since they
| can code some _interesting_ things. All the talk about
| putting programmers out of work has me calling BS but also
| thinking "show me". This task seems like a good combination
| of simple requirements, not much documentation, real world
| existing problem, non-trivial code size, limited scope.
| snickell wrote:
| Yes, very much agree, an interesting benchmark.
| Particularly because it's in a "tier 2" framework (gtkmm)
| in terms of amount of code available to train an LLM on.
| That tests the LLMs ability to plan and problem solve
| compared with, say, "convert to the latest version of
| react" where the LLM has access to tens of thousands
| (more?) of similar ports in its training dataset and more
| has to pattern match.
| phkahler wrote:
| >> Particularly because it's in a "tier 2" framework
| (gtkmm) in terms of amount of code available to train an
| LLM on.
|
| I asked GPT4 to write an empty GTK4 app in C++. I asked
| for a menu bar with File, Edit, View at the top and two
| GL drawing areas separated by a spacer. It produced what
| looked like usable code with a couple lines I suspected
| were out of place. I did not try to compile it so don't
| know if it was a hallucination, but it did seem to know
| about gtkmm 4.
| cluckindan wrote:
| I agree. I tried something similar: a conversion of a
| simple PHP library from one system to another. It was only
| like 500 loc but Gemini 2.5 completely failed around line
| 300, and even then its output contained straight up
| hallucinations, half-brained additions, wrong namespaces
| for dependencies, badly indented code and other PSR style
| violations. Worse, it also changed working code and broke
| it.
| stavros wrote:
| Try asking it to generate a high-level plan of how it's
| going to do the conversion first, then to generate
| function definitions for the new functions, then have it
| generate tests for the new functions, then actually write
| them, while giving it the output of the tests.
|
| It's not like people just one-shot a whole module of
| code, why would LLMs?
| semi-extrinsic wrote:
| I know many people who can and will one-shot a rewrite of
| 500 LOC. In my world, 500 LOC is about the length of a
| single function. I don't understand why we should be
| talking about generating a high level plan with multiple
| tests etc. for a single function.
|
| And I don't think this is uncommon. Just a random example
| from Github, this file is 1800 LOC and 4 functions. It
| implements one very specific thing that's part of a
| broader library. (I have no affiliation with this code.)
|
| https://github.com/elemental/Elemental/blob/master/src/op
| tim...
| stavros wrote:
| > I don't understand why we should be talking about
| generating a high level plan with multiple tests etc. for
| a single function.
|
| You don't have to, you can write it by hand. I thought we
| were talking about how we can make computers write code,
| instead of humans, but it seems that we're trying to
| prove that LLMs aren't useful instead.
| SpaceNoodled wrote:
| No, it's simply being demonstrated that they're not as
| useful as some claim.
| stavros wrote:
| By saying "why do I have to use a specific technique,
| instead of naively, to get what I want"?
| SpaceNoodled wrote:
| "Why do I have to put in more work to use this tool vs.
| not using it?"
| stavros wrote:
| Which is exactly what I said here:
|
| https://news.ycombinator.com/item?id=43537443
| semi-extrinsic wrote:
| If we have to break the problem into tiny pieces that can
| be individually tested in order for LLMs to be useful, I
| think it clearly limits LLM usability to a particular
| niche of programming.
| stavros wrote:
| You don't have to, the LLM will.
| chrismorgan wrote:
| > _It 's not like people just one-shot a whole module of
| code, why would LLMs?_
|
| For conversions between languages or libraries, you often
| _do_ just one-shot it, writing or modifying code from
| start to end in order.
|
| I remember 15 years ago taking a 10,000 line Java code
| base and porting it to JavaScript mostly like this, with
| only a few areas requiring a bit more involved and non-
| sequential editing.
| SpaceNoodled wrote:
| Only 500 lines? That's miniscule.
| blensor wrote:
| Did you paste it into the chat or did you use it with a
| coding agent like Cline?
|
| I am majorly impressed with the combination VSCode +
| Cline + Gemini
|
| Today I had it duplicate an esp32 proram from UDP
| communication to TCP.
|
| It first copied the file ( funnily enough by writing it
| again instead of just straight cp ) Then it started to
| just change all the headers and declarations Then in a
| third step it changed one bigger function And in the last
| step it changed some smaller functions
|
| And it reasoned exactly that way "Let's start with this
| first ... Let's now do this .... " until is was done
| nico wrote:
| > I honestly don't think LLMs can code significant real-
| world things yet and I'm not sure how else to prove that
| since they can code some interesting things
|
| In my experience it seems like it depends on what they've
| been trained on
|
| They can do some pretty amazing stuff in python, but fail
| even at the most basic things in arm64 assembly
|
| These models have probably not seen a lot of GTK3/4 code
| and maybe not even a single example of porting between the
| two versions
|
| I wonder if finetuning could help with that
| kordlessagain wrote:
| What's the point of a one-to-one GTK3 - GTK4 rewrite when the
| user experience doesn't improve at all?
|
| Why not modularize the backend and build a better UI with tech
| that's actually relevant in 2025?
| georgemcbay wrote:
| I'm not the person you are asking but the point of this whole
| thing seems to be as a test for how possible it is for an LLM
| to 'vibe code' a port of this nature and not really because
| they care that much about a port existing.
|
| The fact that they haven't done the port in the normal way
| suggests they basically agree with what you said here (not
| worth the ROI), but hey if you can get the latest AI code
| editor to spit out a perfectly working port in minutes, why
| not?
|
| FWIW, my assessment of LLMs is the same as theirs. The hype
| is far greater than the practical usefulness, and I say this
| as someone who is using LLMs pretty regularly now.
|
| They aren't useless, but the idea that they will be writing
| 90% of our code soon is just completely at odds with my day
| to day experience getting them to do actual specific tasks
| rather than telling them to "write Tetris for XYZ" and blog
| about how great they are because it produced something
| roughly what I asked for without much specificity.
| aleph_minus_one wrote:
| > Why not modularize the backend and build a better UI with
| tech that's actually relevant in 2025?
|
| Doing the second part is to my understanding actually the
| purpose of the stated task.
| pdntspa wrote:
| Why are you calling GTK4 irrelevant? Large swaths of Linux
| run on it and GTK3
| aleph_minus_one wrote:
| > Why are you calling GTK4 irrelevant?
|
| Quite the opposite: Gtk4 is relevant, and porting
| Solvespace to this relevant toolkit is the central part
| of the stated task.
| pdntspa wrote:
| I guess I pinned my response to the wrong thread.
| written-beyond wrote:
| Might be someone implying that electron is a superior
| (modern) solution. Which, if so, I whole heartedly
| disagree with.
| phkahler wrote:
| >> What's the point of a one-to-one GTK3 - GTK4 rewrite when
| the user experience doesn't improve at all?
|
| I'd like to use the same UI on all platforms so that we can
| do some things better (like localization in the text window
| and resizable text) and my preference for that is GTK. I
| tried doing it myself, got frustrated, and stopped because
| there are more important things to work on.
| bix6 wrote:
| Curious if you've tried this yourself yet? I'd love to see side
| by side of a human solo vs a human with copilot for something
| like this. AI will surely make mistakes so who will be faster /
| have better code in the end?
| amelius wrote:
| FWIW, what I want most in Solvespace is a way to do chamfers
| and fillets.
|
| And a way to define parameters (not sure if that's already
| possible).
| phkahler wrote:
| >> FWIW, what I want most in Solvespace is a way to do
| chamfers and fillets.
|
| I've outlined a function for that and started to write the
| code. At a high level it's straight forward, but the details
| are complex. It'll probably be a year before it's done.
|
| >> And a way to define parameters (not sure if that's already
| possible).
|
| This is an active work in progress. A demo was made years
| ago, but it's buggy and incomplete. We've been working out
| the details on how to make it work. I hope to get the units
| issue dealt with this week. Then the relation constraints can
| be re-integrated on top - that's the feature where you can
| type arbitrary equations on the sketch using named parameters
| (variables). I'd like that to be done this year if not this
| summer.
| amelius wrote:
| Sounds great, thanks for all the good work!
|
| By the way, if this would make things simpler, perhaps you
| can implement chamfering as a post-processing step. This
| makes it maybe less general, but it would still be super
| useful.
| stn8188 wrote:
| While I second the same request, I'm also incredibly
| grateful for Solvespace as a tool. It's my favorite MCAD
| program, and I always reach for it before any others. Thank
| you for your work on it!
| jchw wrote:
| I suspect it _probably_ won 't work, although it's not
| necessarily because an LLM architecture could _never_ perform
| this type of work, but rather because it works best when the
| training set contains inordinate sample data. I 'm actually
| quite shocked at what they can do in TypeScript and JavaScript,
| but they're definitely a bit less "sharp" when it comes to
| stuff outside of that zone in my experience.
|
| The ridiculous amount of data required to get here hints that
| there is something wrong in my opinion.
|
| I'm not sure if we're totally on the same page, but I
| understand where you're coming from here. Everyone keeps
| talking about how transformational these models are, but when
| push comes to shove, the cynicism isn't out of fear or panic,
| its disappointment over and over and over. Like, if we had an
| army of virtual programmers fixing serious problems for open
| source projects, I'd be more excited about the possibilities
| than worried about the fact that I just lost my job. Honest to
| God. But the thing is, if that really _were_ happening, we 'd
| see it. And it wouldn't have to be forced and exaggerated all
| the time, it would be _plainly_ obvious, like the way AI art
| has absolutely _flooded_ the Internet... except I don 't give a
| damn if code is soulless as long as it's good, so it would
| possibly be more welcome. (The only issue is that it most
| likely actually _suck_ when that happens, and rather just be
| functional _enough_ to get away with, but I like to _try_ to be
| optimistic once in a while.)
|
| You really make me want to try this, though. Imagine if it
| worked!
|
| Someone will probably beat me to it if it can be done, though.
| skydhash wrote:
| > _the cynicism isn 't out of fear or panic, its
| disappointment over and over and over_
|
| Very much this. When you criticize LLM's marketing, people
| will say you're a ludite.
|
| I'd bet that no one actually likes to write code, as in
| typing into an editor. We know how to do it, and it's easy
| enough to enter in a flow state while doing it. But everyone
| is trying to write less code by themselves with the
| proliferation of reusable code, libraries, framework, code
| generators, metaprogramming,...
|
| I'd be glad if I could have a DAW or CAD like interface with
| very short feedback (the closest is live programming with
| Smalltalk). So that I don't have to keep visualizing the
| whole project (it's mentally taxing).
| SpaceNoodled wrote:
| I like writing code. It's a fun and creative endeavor to
| figure out how to write as little as possible.
| galbar wrote:
| >I'd bet that no one actually likes to write code
|
| And you'd be wrong. I, for one, enjoy the process of
| handcrafting the individual mechanisms of the systems I
| create.
| skydhash wrote:
| Do you like writing all the if, def, public void, import
| keywords? That is what I'm talking about. I prefer IDE
| for java and other verbose languages because of the code
| generation. And I configure my editors for templates and
| snippets because I don't like to waste time on entering
| every single character (and learned vim because I can act
| on bigger units; words, lines, whole blocks).
|
| I like programming, I do not like coding.
| jay_kyburz wrote:
| So yesterday I wanted to convert a color pallet I had in Lua
| that was 3 rgb ints, to Javascript 0x000000 notation. I
| sighed, rolled my eyes, but before I started this incredibly
| boring mindless task, asked Gamini if it would just do it for
| me. It worked, and I was happy, and I moved on.
|
| Something is happening, its just not exciting as some people
| make it sound.
| jchw wrote:
| Be a bit more careful with that particular use case. It
| usually works, but depending on circumstances, LLMs have a
| relatively high tendency to start making the wrong
| correlations and give you results that are not actually
| accurate. (Colorspace conversions make it more obvious, but
| I think even simpler problems can get screwed up.)
|
| Of course, for that use case, you can _probably_ do a bit
| of text processing in your text processing tools of choice
| to do it without LLMs. (Or have LLMs write the text
| processing pipeline to do it.)
| ModernMech wrote:
| > if that really were happening, we'd see it.
|
| You're right, instead what we see is the emergence of "vibe
| coding", which I can best describe as a summoning ritual for
| technical debt and vulnerabilities.
| stickfigure wrote:
| My coding challenges are all variations on "start with this
| 1.5M line Spring project, full of multi-thousand-line files..."
| qwertox wrote:
| But you are aware that their limited context length just
| won't be able to deal with this?
|
| That's like saying that you're judging a sedan by its
| capability of performing the job of a truck.
|
| Wait, you were being sarcastic?
| stickfigure wrote:
| I am indeed saying that a sedan is incapable of handling my
| gigantic open-pit superfund site.
|
| But I'll go a little farther - most meaningful, long-lived,
| financially lucrative software applications are
| metaphorically closer to the open-pit mine than the
| adorable backyard garden that AI tools can currently
| handle.
| iamleppert wrote:
| GTK is an abomination of a UI framework. You should be looking
| for another way to manage your UI entirely, not trying to keep
| up with the joneses, who will no doubt release something new in
| short order and set yet another hoop to jump through, without
| providing any benefit to you at all.
|
| It's openly hostile to not consider the upgrade path of
| existing users, and make things so difficult that it requires
| huge lifts just to upgrade versions of something like a UI
| framework.
| phkahler wrote:
| >> GTK is an abomination of a UI framework.
|
| I respectfully disagree with that. I think it's a solid UI
| framework, but...
|
| >> It's openly hostile to not consider the upgrade path of
| existing users, and make things so difficult that it requires
| huge lifts just to upgrade versions of something like a UI
| framework.
|
| I completely agree with you on that. We barely use any UI
| widgets so you'd think the port would be easy enough. I went
| through most of the checklist for changes you can make while
| still using GTK3 in prep for 4. "Don't access event structure
| members directly, use accessor functions." OK I made that
| change which made the code a little more verbose. But then
| they changed a lot of the accessor functions going from 3 to
| 4. Like WTF? I'm just trying to create a menu but menus don't
| exist any more - you make them out of something else. Oh and
| they're not windows they are surfaces. Like why?
|
| I hope with some of the big architectural changes out of the
| way they can stabilize and become a nice boring piece of
| infrastructure. The talk of regular API changes every 3-5
| years has me concerned. There's no reason for that.
| MrScruff wrote:
| The evidence given really doesn't justify the conclusion. Maybe
| it suggests 2.5 Pro might be better if you're asking it to build
| Javascript apps from scratch, but that hardly equates to "It's
| better at coding". Feels like a lot of LLM articles follow this
| pattern, someone running their own toy benchmarks and confidently
| extrapolating broad conclusions from a handful of data points.
| The SWE-Bench result carries a bit more weight but even that
| should be taken with a pinch of salt.
| throwaway0123_5 wrote:
| > The SWE-Bench result carries a bit more weight
|
| Although I have issues with it (few benchmarks are perfect), I
| tend to agree. Gemini's 63.8 from Sonnet's 62.3 isn't a huge
| jump though. To Gemini's credit, it solved a bug in my PyTorch
| code yesterday that o1 (through the web app) couldn't (or at
| least didn't with my prompts).
| namaria wrote:
| There are three things this hype cycle excels at. Getting money
| from investors for foundational model creators and startup.ai;
| spinning lay offs as a good sign for big corps; and trying to
| look like clever tech blogger for people looking for clout
| online.
| dysoco wrote:
| Useful article but I would rather see comparisons where it takes
| a codebase and tries to modify it given a series of instructions
| rather than attempting to zero-shot implementations of games or
| solving problems. I feel like it fits better the real use cases
| of these tools.
| claudiug wrote:
| that guy Theo-t3 is so strange for my taste :)
| dsign wrote:
| I guess depends on the task? I have very low expectations for
| Gemini, but I gave it a run with a signal processing easy problem
| and it did well. It took 30 seconds to reason through a problem
| that would have taken me between 5 to 10 minutes to reason.
| Gemini's reasoning was sound (but it took me a couple of minutes
| to decide that), and it also wrote the functions with the changes
| (which took me an extra minute to verify). It's not a definitive
| win in _time_ , but at least there was an extra pair of "eyes"--
| or whatever that's called with a system like this one.
|
| All in all, I think we humans are well on our way to become legal
| flesh[ _].
|
| [_] The part of the system to whip or throw in jail when a
| human+LLM commit a mistake.
| vonneumannstan wrote:
| >I guess depends on the task? I have very low expectations for
| Gemini, but I gave it a run with a signal processing easy
| problem and it did well. It took 30 seconds to reason through a
| problem that would have taken me between 5 to 10 minutes to
| reason. Gemini's reasoning was sound (but it took me a couple
| of minutes to decide that), and it also wrote the functions
| with the changes (which took me an extra minute to verify).
| It's not a definitive win in time, but at least there was an
| extra pair of "eyes"--or whatever that's called with a system
| like this one.
|
| I wonder if you treat code from a Jr engineer the same way?
| Seems impossible to scale a team that way. You shouldnt need to
| verify every line but rather have test harnesses that ensure
| adherence to the spec.
| antirez wrote:
| In complicated code I'm developing (Redis Vector Sets) I use both
| Claude 3.7 and Gemini 2.5 PRO to perform code reviews. Gemini 2.5
| PRO can find things that are outside Claude abilities, even if
| Gemini, as a general purpose model, is worse. But It's inherently
| more powerful at reasoning on complicated code stuff, threading,
| logical errors, ...
| larodi wrote:
| Is this to say that you're writing the code manually and having
| the model verify for various errors, or also employing the
| model for actual code work.
|
| Do you instruct the code to write in "your" coding style?
| nprateem wrote:
| Sometimes these models get tripped up with a mistake. They'll add
| a comment to the code saying "this is now changed to [whatever]"
| but it hasn't made the replacement. I tell it it hasn't made the
| fix, it apologises and does it again. Subsequent responses lead
| to more profuse apologies with assertions it's definitely fixed
| it this time when it hasn't.
|
| I've seen this occasionally with older Claude models, but Gemini
| did this to me very recently. Pretty annoying.
| larodi wrote:
| Funny how the "give e Dinosaur game" from 'single prompt' is
| translates into FF's dinosaur 404 not found game.
| 0x1ceb00da wrote:
| I tried the exact prompt and model from the blog post, but my
| outputs were way off--anyone else see this? This is the best of 3
| output of flight simulator prompt (gemini 2.5 pro
| (experimental)):
|
| https://imgur.com/0uwRbMp
| superkuh wrote:
| What is most apparent to me (putting in existing code and asking
| for changes) is Gemini 2.5 Pro's tendency to refuse to actually
| type out subroutines and routinely replace them with either stubs
| or comments that say, "put the subroutines back here". It makes
| it so even if Gemini results are good they're still broken and
| require lots of manual work/thinking to get the subroutines back
| into the code and hooked up properly.
|
| With a 1 million token context you'd think they'd let the LLM
| actually use it but all the tricks to save token count just make
| it... not useful.
| evantbyrne wrote:
| The common issue I run into with all LLMs is that they don't seem
| to be able to complete the same coding tasks where googling
| around also fails to provide working solutions. In particular,
| they seem to struggle with libraries/APIs that are less
| mainstream.
| siliconc0w wrote:
| This is interesting but too greenfield, someone should do one
| with an existing OSS project and try to add a feature or fix a
| bug.
| stared wrote:
| At this level, it is very contextual - depending on your tools,
| prompts, language, libraries, and the whole code base. For
| example, for one project, I am generating ggplot2 code in R;
| Claude 3.5 gives way better results than the newer Claude 3.7.
|
| Compare and contrast https://aider.chat/docs/leaderboards/,
| https://web.lmarena.ai/leaderboard, https://livebench.ai/#/.
| lherron wrote:
| These one-shot prompts aren't at all how most engineers use these
| models for coding. In my experience so far, Gemini 2.5 Pro is
| great at generating code but not so great at instruction
| following or tool usage, which are key for any iterative coding
| tasks. Claude is still king for that reason.
| jgalt212 wrote:
| Agreed. I've never successfully one-shotted anything non-
| trivial or non-pedagogical.
| HarHarVeryFunny wrote:
| I'd like to see an honest attempt by someone to use one of these
| SOTA models to code an entire non-trivial app. Not a "vibe
| coding" flappy bird clone or minimal ioS app (call API to count
| calories in photo), but something real - say 10K LOC type of
| complexity, using best practices to give the AI all the context
| and guidance necessary. I'm not expecting the AI to replace the
| programmer - just to be a useful productivity tool when we move
| past demos and function writing to tackling real world projects.
|
| It seems to me that where we are today, AI is only useful for
| coding for very localized tasks, and even there mostly where it's
| something commonplace and where the user knows enough to guide
| the AI when it's failing. I'm not at all convinced it's going to
| get much better until we have models that can actually learn (vs
| pre-trained) and are motivated to do so.
| gedy wrote:
| I know they are capable of more, but I also tire of people
| being so enamored with "bootstrap a brand new app" type AI
| coding - like is that even a big part of our job? In 25 years
| of dev work, I've needed to do that for commercial production
| app like... twice? 3 times? Help me deal with existing apps and
| codebases please.
| kaiokendev wrote:
| I made this NES emulator with Claude last week [0]. I'd say it
| was a pretty non-trivial task. It involved throwing a lot of
| NESDev docs, Disch mapper docs, and test rom output + assembly
| source code to the model to figure out.
|
| [0]: https://kaiokendev.github.io/nes/
| HarHarVeryFunny wrote:
| How would you characterize the overall structural complexity
| of the project, and degree of novelty compared to other NES
| emulators Claude may have seen during training ?
|
| I'd be a bit suspect of an LLM getting an emulator right,
| when all it has to go on is docs and no ability to test
| (since pass criteria is "behaves same as something you don't
| have access to")... Did you check to see the degree to which
| it may have been copying other NES emulators ?
| kaiokendev wrote:
| > How would you characterize the overall structural
| complexity of the project, and degree of novelty compared
| to other NES emulators Claude may have seen during training
| ?
|
| Highly complex, fairly novel.
|
| Emulators themselves, for any chipset or system, have a
| very learnable structure: there are some modules, each
| having their own registers and ways of moving data between
| those registers, and perhaps ways to send interrupts
| between those modules. That's oversimplifying a bit, but if
| you've built an emulator once, you generally won't be
| blindsided when it comes to building another one. The bulk
| of the work lies in dissecting the hardware, which has
| already been done for the NES, and more open architectures
| typically have their entire pinouts and processes available
| online. All that to say - I don't think Claude would have
| difficulty implementing most emulators - it's good enough
| at programming and parsing assembly that as long as the
| underlying microprocessor architecture is known, it can
| implement it.
|
| As far as other NES emulators goes, this project does many
| things in non-standard ways, for instance I use per-pixel
| rendering whereas many emulators use scanline rendering. I
| use an AudioWorklet with various mixing effects for audio,
| whereas other emulators use something much simpler or don't
| even bother fully implementing the APU. I can comfortably
| say there's no NES emulator out there written the way this
| one is written.
|
| > I'd be a bit suspect of an LLM getting an emulator right,
| when all it has to go on is docs and no ability to test
| (since pass criteria is "behaves same as something you
| don't have access to")... Did you check to see the degree
| to which it may have been copying other NES emulators ?
|
| Purely javascript-based NES emulators are few in number,
| and those that implement all aspects of the system even
| fewer, so I can comfortably say it doesn't copy any of the
| ones I've seen. I would be surprised if it did, since I
| came up with most of the abstractions myself and guided
| Claude heavily. While Claude can't get docs on it's own, I
| can. I put all the relevant documentation in the context
| window myself, along with the test rom output and source
| code. I'm still commanding the LLM myself, it's not like I
| told Claude to build an emulator and left it alone for 3
| days.
| HarHarVeryFunny wrote:
| Interesting - thanks!
|
| Even with your own expert guidance, it does seem
| impressive that Claude was able complete a project like
| this without getting bogged down in the complexity.
| nowittyusername wrote:
| I am considering training a custom Lora on atari roms and see
| if i could get a working game out of it with the Loras use.
| The thinking here is that atari, nes, snes, etc... roms are a
| lot smaller in size then a program that runs natively on
| whatever os. Lees lines of code to write for the LLM means
| less chance of a screw up. take the rom, convert it to
| assembly, perform very detailed captions on the rom and
| train.... if this works this would enable anyone to create
| games with one prompt which are a lot higher quality then the
| stuff being made now and with less complexity. If you made an
| emulator with the use of an llm, that means it understands
| assembly well enough so i think there might be hope for this
| idea.
| lordswork wrote:
| I'm at 3k LOC on a current Rust project I'm mostly vibe coding
| with my very limited free time. Will share when I hit 10k :)
| HarHarVeryFunny wrote:
| Would you mind sharing what the project is, and which AI you
| are using? No sign so far of AI's usefulness slowing down as
| the complexity increases?
| genewitch wrote:
| No one links their ai code, you noticed?
| SweetSoftPillow wrote:
| Aider is written with AI, you're welcome.
| Pannoniae wrote:
| I've been using Claude 3.7 for various things, including
| helping in game development tasks. The generated code usually
| requires editing and it can't do autonomously more than a few
| functions at once but it's a fairly useful tool in terms of
| productivity. And the logic part is also quite good, can design
| out various ideas/algorithms, and suggest some optimisations.
|
| Tech stack is nothing fancy/rare but not the usual ReactJS slop
| either - it's C# with OpenGL.
|
| I can't comment about the best practices though because my
| codebase follows none of them.
|
| Yes, the user has to know enough to guide the AI when it's
| failing. So it can't exactly replace the programmer as it is
| now.
|
| It really can't do niche stuff however - like SIMD. Maybe it
| would be better if I compiled a cheatsheet of .NET SIMD
| snippets and howtos because this stuff isn't really on the
| internet in a coherent form at all. So it's highly unlikely
| that it was trained on that.
| HarHarVeryFunny wrote:
| Interesting - thanks! This isn't the type of tech stack where
| I'd have expected it to do very well, so the fact that you're
| at least finding it to be productive is encouraging, although
| the (only) "function level competency" is similar to what
| I've experienced - enough to not have been encouraged to try
| anything more complex.
| redox99 wrote:
| I use cursor agent mode with claude on my NextJS frontend and
| Typescript GraphQL backend. It's a real, reasonably sized,
| production app that's a few years old (pre-ChatGPT).
|
| I vibe code the vast majority features nowadays. I generally
| don't need to write a single line of code. It often makes some
| mistakes but the agent figures out that the tests fail, or it
| doesn't build, fixes it, and basically "one shots" it after it
| doing its thing.
|
| Only occasionally I need to write a few lines of code or give
| it a hint when it gets stuck. But 99% of the code is written by
| cursor.
| HarHarVeryFunny wrote:
| That's pretty impressive - a genuine real-world use case
| where the AI is doing the vast majority of the work.
| orange_puff wrote:
| When you say "vibe code" do you mean the true definition of
| that term, which is to blindly accept any code generated by
| the AI, see if it works (maybe agent mode does this) and move
| on to the next feature? Or do you mean prompt driven
| development, where although you are basically writing none of
| the code, you are still reading every line and maintain high
| involvement in the code base?
| redox99 wrote:
| Kind of in between. I accept a lot of code without ever
| seeing it, but I check the critical stuff that could cause
| trouble. Or stuff that I know the AI is likely to mess up.
|
| Specifically for the front end I mostly vibe code, and for
| the backend I review a lot of the code.
|
| I will often follow up with prompts asking it to extract
| something to a function, or to not hardcode something.
| axkdev wrote:
| I dunno what you would consider non trivial. I am building a
| diffing plugin for neovim. The experience is.. mixed. The fast
| progression at the start was impressive, but now as the code
| base have grown the issues show up. The code is a mess. Adding
| one feature breaks another and so on. I have no problem in
| using the agent on code that I know very well, because I can
| stir it in the exact direction I want. But vibe coding
| something I don't fully understand is a pain.
| theonething wrote:
| anybody use Claude, Gemini, ChatGPT,etc for fixing css issues?
| I've tried with Claude 3.7 with lackluster results. I provided a
| screen shot and asked it to fix an unwanted artifact.
|
| Wondering about other people's experiences.
| igorguerrero wrote:
| consistently 1-shots entire tickets
|
| Uhh no? First of that's a huge exaggeration even on human coders,
| second, I think for this to be true your project is probably a
| blog.
| eugenekolo wrote:
| It's definitely an attempt to compare models, and Gemini clearly
| won in the tests. But, I don't think the tests are particularly
| good or showcasing. It's generally an easy problem to ask AI to
| give you greenfields JS code for common tasks, and Leetcode's
| been done 1000 times on Github and stackoverflow, so the
| solutions are all right there.
|
| I'd like to see tests that are more complicated for AI things
| like refactoring an existing codebase, writing a program to auto
| play God of War for you, improving the response time of a
| keyboard driver and so on.
| skerit wrote:
| I've been using Gemini 2.5 Pro with Roo-Code a lot these past few
| days. It has really helped me a lot. I managed to get it to
| implemented entire features. (With some manual cleaning up at the
| end)
|
| The fact that it's free for now (I know they use it for training,
| that's OK) is a big plus, because I've had to restart a task from
| scratch quite a few time. If I calculate what this would have
| cost me using Claude, it would have been 200-300 euros.
|
| I've noticed that as soon as it makes a mistake (messing up the
| diff format is a classic), the current task is basically a total
| loss. For some reason, most coding tools basically just inform
| the model they made a mistake and should try again... but at that
| point, it's broken response is part of the history, and it's
| basically multi-shotting itself into making more mistakes. They
| should really just filter these out.
| jstummbillig wrote:
| This has not been my experience using it with Windsurf, which
| touches on an interesting point: When a tool has been optimized
| around one model, how much is it inhibiting another (newly
| released) model and how much adjustment is required to take
| advantage of the new model? Increasingly, as tools get better, we
| will not directly interact with the models. I wonder how the tool
| makers handle this.
| thedangler wrote:
| I still can't get any LLM to use my niche API and build out API
| REST requests for all the endpoints. It just makes stuff up even
| though it knows the api documentation. As soon as one can do
| that, I'll be sold. until then I feel like its all coding
| problems its seen in github or source code somewhere.
| overgard wrote:
| I remember back in the day when I did Visual Basic in the 90s
| there were a lot of cool "New Project from Template" things in
| Visual Studio, especially when you installed new frameworks and
| SDKs and stuff like that. With a click of a button you had
| something that kind of looked like a professional app! Or even
| now, the various create-whatever-app tooling in npm and node
| keeps on that legacy.
|
| Anyway, AI "coding" makes me think of that but on steroids. It's
| _fine_ , but the hype around it is silly, it's like declaring you
| can replace Microsoft Word because "New Project From Template"
| you got a little rich text widget in a window with a toolbar.
|
| One of the things mentioned in the article is the writer was
| confused that Claude's airplane was sideways. But it makes
| perfect sense, Claude doesn't really care about or understand
| airplanes, and as soon as you try to refine these New Project
| From Template things the AI quickly stops being useful.
| charcircuit wrote:
| >Minecraft-styled block buildings
|
| The buildings weren't minecraft style in either case. They
| weren't formed on a voxel grid and the textures weren't 16x16,
| but rather a rectangle or at least stretched to one. Also
| buildings typically are not just built as a cuboid.
| sxp wrote:
| One prompt I use for testing is: "Using three.js, render a
| spinning donut with gl.TRIANGLE_STRIP". The catch here is that
| three.js doesn't support TRIANGLE_STRIP for architectural
| reasons[1]. Before I knew this, I got confused as to why all the
| AIs kept failing and gaslighting me about using TRIANGLE_STRIP.
| If the AI fails to tell the user that this is an impossible task,
| then it has failed the test. So far, I haven't found an AI that
| can determine that the request isn't valid.
|
| [1] https://discourse.threejs.org/t/is-there-really-no-way-to-
| us...
| nisten wrote:
| They nerfed it as of sunday March 30, a lot of people noticed
| performance drop and it rambling.
|
| https://x.com/nisten/status/1906141823631769983
|
| Would be nice if this review actually wrote exactly when they
| conducted their test.
| mtaras wrote:
| ("it" being the Claude 3.7, not the Gemini)
| breadwinner wrote:
| The loser in the AI model competition appears to be... Microsoft.
|
| When ChatGPT was the only game in town Microsoft was seen as a
| leader, thanks to their wise investment in Open AI. They relied
| on Open AI's model and didn't develop their own. As a result
| Microsoft has no interesting AI products. Copilot is a flop. Bing
| failed to take advantage of AI, Perplexity ate their lunch.
|
| Satya Nadella last year: "Google should have been the default
| winner in the world of big tech's AI race".
|
| Sundar Pichai's response: "I would love to do a side-by-side
| comparison of Microsoft's own models and our models any day, any
| time. They are using someone else's model."
|
| See: https://www.msn.com/en-in/money/news/sundar-pichai-vs-
| satya-...
| maxloh wrote:
| Note that Microsoft do have their own LLM team, and their own
| model called Phi-4.
|
| https://huggingface.co/microsoft/phi-4
| VladVladikoff wrote:
| Recently I was looking for a small LLM that could perform
| reasonably well while answering questions with low latency,
| for near realtime conversations running on a single RTX 3090.
| I settled on Microsoft's Phi-4 model so far. However I'm not
| sure yet if my choice is good and open to more suggestions!
| mywittyname wrote:
| I've been using claude running via Ollama
| (incept5/llama3.1-claude) and I've been happy with the
| results. The only annoyance I have is that it won't search
| the internet for information because that capability is
| disabled via flag.
| danielbln wrote:
| That's.. that's not the Claude people talk about when
| they say Claude. Just to be sure.
| gnatolf wrote:
| Any way you can back up that Copilot is a flop?
| breadwinner wrote:
| Lots of articles on it... and I am not even talking about
| competitors like Benioff [1]. I am talking about user
| complaints like this [2]. Users expect Copilot to be fully
| integrated, like Cursor is into VSCode. Instead what you get
| is barely better than typing into standalone AI chats like
| Claude.AI.
|
| [1] https://www.cio.com/article/3586887/marc-benioff-rails-
| again...
|
| [2] https://techcommunity.microsoft.com/discussions/microsoft
| 365...
| paavohtl wrote:
| The linked complaint is specifically about Microsoft
| Copilot, which despite the name is completely unrelated to
| the original GitHub Copilot. VS Code's integrated GitHub
| Copilot nowadays has the Copilot Edits feature, which can
| actually edit, refactor and generate files for you using a
| variety of models, pretty much exactly like Cursor.
| breadwinner wrote:
| Sorry I meant Microsoft Copilot should be as integrated
| into Office as Cursor is into VSCode. I was not talking
| about GitHub Copilot.
| jcmp wrote:
| When my parent speak about AI, they call it Copilot. Mircosoft
| has a big Advantage that they can integrate AI in many daily
| used products, where it is not competing with their core
| product like Google
| ErrorNoBrain wrote:
| And google has it built into my phone's text message app
|
| these days it seems like everyone is trying to get their AI
| to be the standard.
|
| i wonder how things will look in 10 years.
| ZeWaka wrote:
| I don't think the Copilot product is a flop - they're doing
| quite well selling it along with GitHub and Visual Studio
| (Code).
|
| The best part about it, coding-wise, is that you can choose
| between 7 different models.
| airstrike wrote:
| I think he's talking about Microsoft Copilot 365, not the
| coding assistant.
|
| Makes one wonder how much they are offering to the owner of
| www.copilot.com and why on God's green earth they would
| abandon the very strong brand name "Office" and
| www.office.com
| l5870uoo9y wrote:
| Had to lookup office.com myself to see it; their office
| package is literally called MS Copilot.
| airstrike wrote:
| It gets worse, actually. My comment was inaccurate
| because it could also be the windows assistant outside of
| MS Office.
|
| At this point, Occam's Razor dictates companies must make
| these terribly confusing branding choices on purpose. It
| has to be by design.
| breadwinner wrote:
| I consider Copilot a flop because it can't do anything. For
| example open Copilot on Windows and ask it to increase
| volume. It can't do it, but it will give you instructions for
| how to do it. In other words it is no better than standalone
| AI chat websites.
| dughnut wrote:
| Copilot is the only authorized AI at my company (50K FTE). I
| would be cautious to make any assumptions about how well anyone
| is doing in the AI space without some real numbers. My cynical
| opinion on enterprise software sales is that procurement
| decisions have absolutely nothing to do with product cost,
| performance, or value.
| stared wrote:
| Just a moment ago I tried to use Gemini 2.5 (in Cursor) to use
| Python Gemini SDK. It failed, even with a few iterations.
|
| Then run Claude 3.7 - it worked fine.
|
| So yeah, depends on the case. But I am surprised that model
| creators don't put extra effort into dealing with setting their
| own tools.
| Extropy_ wrote:
| Why is Grok not in their benchmarks? I don't see comparisons to
| Grok in any recent announcements about models. In fact, I see
| practically no discussion of Grok on HN or anywhere except
| Twitter in general.
| nathanasmith wrote:
| Is there an API for Grok yet? If not that could be the issue.
| raffkede wrote:
| I had huge success letting Gemini 2.5 oneshot whole codebases in
| a single text file format and then split it up with a script.
| It's putting in work for like 5 minutes and spits out a working
| codebase, I also asked it to show of a little bit and it almost
| one shotted a java cloud service to generate pdf invoices from
| API calls, (made some minor mistakes but after feeding them back
| it fixed them)
|
| I basically use two scripts one to flatten the whole codebase
| into one text file and one to split it, give it a shot it's
| amazing...
| archeantus wrote:
| Can you please expound on this? You're using this approach to
| turn an existing codebase into a single file and then asking
| Gemini to make changes/enhancements? Does it also handle
| breaking the files back out? Would love more info!
| raffkede wrote:
| I created a script that merges all files in a directory into
| this format, and a counterpart that splits it again. Below is
| just a small sample I asked it to create to show the format,
| but I did it with almost 80 files including lots of
| documentation etc.
|
| When providing the flat format it was able to replicate it
| without much instructions for a blank prompt i had success
| with the prompt below
|
| ===FILE=== Index: 1 Path:
| src/main/java/com/example/myapp/Greeter.java Length: 151
| Content: package com.example.myapp;
|
| public class Greeter { public String getGreeting() { return
| "Hello from the Greeter class!"; } } ===ENDFILE=== ===FILE===
| Index: 2 Path: src/main/java/com/example/myapp/Main.java
| Length: 222 Content: package com.example.myapp;
|
| public class Main { public static void main(String[] args) {
| Greeter greeter = new Greeter(); String message =
| greeter.getGreeting(); System.out.println("Main app says: " +
| message); } } ===ENDFILE=== ===FILE=== Index: 3 Path: pom.xml
| Length: 659 Content: <?xml version="1.0" encoding="UTF-8"?>
| <project xmlns="http://maven.apache.org/POM/4.0.0"
| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
| xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
| http://maven.apache.org/xsd/maven-4.0.0.xsd">
| <modelVersion>4.0.0</modelVersion>
| <groupId>com.example</groupId> <artifactId>my-simple-
| app</artifactId> <version>1.0-SNAPSHOT</version>
| <properties>
| <maven.compiler.source>17</maven.compiler.source>
| <maven.compiler.target>17</maven.compiler.target>
| <project.build.sourceEncoding>UTF-8</project.build.sourceEnco
| ding> </properties>
|
| </project> ===ENDFILE===
|
| Prompt to request the format if starting from scratch:
| Present the entire codebase using the following multi-file
| format:
|
| The codebase should be presented as a single, monolithic text
| output. Inside this output, represent each file of the
| project individually using the following structure:
|
| Start Marker: Each file must begin with the exact line:
| ===FILE===
|
| Metadata Block: Immediately following the start marker,
| include these four specific metadata lines, each on its own
| line:
|
| Index: <N> (where <N> is a sequential integer index for the
| file, starting from 1).
|
| Path: <path/to/file/filename.ext> (The full relative path of
| the file from the project's root directory, e.g., index.html,
| css/style.css, js/script.js, jobs.html, etc.).
|
| Length: <L> (where <L> is the exact character count of the
| file's content that follows).
|
| Content: (This literal line acts as a separator).
|
| File Content: Immediately after the Content: line, include
| the entire raw content of the file. Preserve all original
| line breaks, indentation, and formatting exactly as it should
| appear in the actual file.
|
| End Marker: Each file's section must end with the exact line:
| ===ENDFILE===
|
| Ensure all necessary files for the project (HTML, CSS, JS)
| are included sequentially within the single output block
| according to this structure.
|
| Crucially, enclose the entire multi-file output, starting
| from the very first ===FILE=== line down to the very last
| ===ENDFILE=== line, within a single Markdown fenced code
| block using exactly five backticks (`````) on the lines
| immediately before the first ===FILE=== and immediately after
| the last `===ENDFILE===`. This ensures that any triple
| backticks (```) within the generated file content are
| displayed correctly.
| ZeroTalent wrote:
| There is a better way that I'm using:
|
| 1. Cursor Pro with Sonnet to implement things the Cursor way.
|
| 2. Install the Gemini Code extension in Cursor.
|
| 3. Install the Gemini Coder Connector Chrome extension:
| https://chromewebstore.google.com/detail/gemini-coder-
| connec...
|
| 4. Get the free aistudio.google.com Gemini API and connect
| the extensions.
|
| 5. Feed your codebase or select files via the Cursor
| extension and get the implementation from
| aistudio.google.com.
|
| I prefer having Sonnet implement it via Cursor rather than
| Gemini because it can automatically go through all the
| linting/testing loops without my extra input, run the server,
| and check if there are no errors.
| mvdtnz wrote:
| Anything that can fit in a single LLM output is not a
| "codebase" it's just a start. Far too many people with no
| experience in real software projects think their little 1800
| line apps are representative of real software development.
| phforms wrote:
| I like using LLMs more as coding assistents than have them write
| the actual code. When I am thinking through problems of code
| organization, API design, naming things, performance
| optimization, etc., I found that Claude 3.7 often gives me great
| suggestions, points me in the right direction and helps me to
| weigh up pros and cons of different approaches.
|
| Sometimes I have it write functions that are very boilerplate to
| save time, but I mostly like to use it as a tool to think through
| problems, among other tools like writing in a notebook or drawing
| diagrams. I enjoy programming too much that I'd want an AI to do
| it all for me (it also helps that I don't do it as a job though).
| anotherpaulg wrote:
| Gemini 2.5 Pro set a wide SOTA on the aider polyglot coding
| leaderboard [0]. It scored 73%, well ahead of the previous 65%
| SOTA from Sonnet 3.7.
|
| I use LLMs to improve aider, which is >30k lines of python. So
| not a toy codebase, not greenfield.
|
| I used Gemini 2.5 Pro for the majority of the work on the latest
| aider release [1]. This is the first release in a very long time
| which wasn't predominantly written using Sonnet.
|
| The biggest challenge with Gemini right now is the very tight
| rate limits. Most of my Sonnet usage lately is just when I am
| waiting for Gemini's rate limits to cool down.
|
| [0] https://aider.chat/docs/leaderboards/
|
| [1] https://aider.chat/docs/faq.html#what-llms-do-you-use-to-
| bui...
| atonse wrote:
| As someone who just adopted Cursor (and MCP) 2-3 weeks ago,
| Aider seems like a different world.
|
| The examples of "create a new simple video game" cause me to
| glaze over.
|
| Do you have a screencast of how you use aider to develop aider?
| I'd love to see how a savvy expert uses these tools for real-
| world solutions.
| anotherpaulg wrote:
| I actually get asked for screencasts a lot, so I made
| recently made some [0].
|
| The recording of adding support for 100+ new coding languages
| with tree-sitter [1] shows some pretty advanced usage. It
| includes using aider to script downloading a collection of
| files, and using ad-hoc bash scripts to have aider modify a
| collection of files.
|
| [0] https://aider.chat/docs/recordings/
|
| [1] https://aider.chat/docs/recordings/tree-sitter-language-
| pack...
| InTheArena wrote:
| The amazing bit about claude code is it's ability to read code,
| and fit into the existing code base. I tried visual studio code
| w/ roo, and it blew up my 50 daily request limit immediately. Any
| suggestions on better tooling for a claude code like experience
| with Gemeni 2.5 pro?
| sfjailbird wrote:
| Every test task, including the coding test, is a greenfield
| project. Everything I would consider using LLMs for is not. Like,
| I would always need it to do some change or fix on a (large)
| existing project. Hell, even the examples that were generated
| would likely need subsequent alterations (ten times more effort
| goes into maintaining a line of code than writing it).
|
| So these tests are meaningless to me, as a measure of how useful
| these models are. Great for comparison with each other, but would
| be interesting to include some tests with more realistic work.
| mvdtnz wrote:
| I must be missing something about Gemini. When I use the web UI
| it won't even let me upload source code files directly. If I
| manually copy some code into a directory and upload that I do get
| it to work, but the coding output is hilariously bad. It produces
| ludicrously verbose code that so far for me has been 200% wrong
| every time.
|
| This is on a Gemini 2.5 Pro free trial. Also - god damn is it
| slow.
|
| For context this is on a 15k LOC project built about 75% using
| Claude.
| cadamsdotcom wrote:
| Very nice comparison but constrained to greenfield.
|
| Would love to see a similar article that uses LLMs to add a
| feature to Gimp, or Blender.
| ionwake wrote:
| Sorry for the noob question, but claude has claudecode, does
| Gemini Pro work with any software in the same way "claudecode"
| works? If so what software would I use with it? Thank you.
| simonw wrote:
| Aider is worth a look.
|
| The current rate limits for Gemini 2.5 Pro make it hard to run
| something like Claude Code with it, since that tool is very API
| chatty.
| degrews wrote:
| Hi Simon. Do you recommend aider over Cursor? I've always
| used aider, and like it, but it just seems like Cursor is
| overtaking it in terms of features, and I wonder if sticking
| with aider still makes sense.
| simonw wrote:
| I don't actually use Aider or Cursor myself - I still
| mostly work in the ChatGPT and Claude web interfaces (or
| apps) directly and do a lot of copy and pasting.
| degrews wrote:
| Most people use Cursor. Aider and Cline are other options. All
| of these work with all of the popular LLM APIs. Even among
| people using Claude, I would bet more of them are using Claude
| through Cursor than through Claude code.
| asdf6969 wrote:
| Does anyone know guides to integrate this with any kind of big co
| production application? The examples are all small toy projects.
| My biggest problems are like there's 4 packages I need to change
| and 3 teams and half a dozen micro services are involved.
|
| Does any LLM do this yet? I want to throw it at a project that's
| in package and micro service hell and get a useful response. Some
| weeks I spend almost all my time cutting tickets to other teams,
| writing documents, and playing politics when the other teams
| don't want me to touch their stuff. I know my organization is
| broken but this is the world I live in.
| benbojangles wrote:
| Don't know what the fuss is about over a dino jump game, Claude
| made me a flappy bird esp32 game last month in one go:
| https://www.instagram.com/reel/DGcgYlrI_NK/?utm_source=ig_we...
___________________________________________________________________
(page generated 2025-03-31 23:01 UTC)