hngopher.com

       [HN Gopher] Measuring the impact of AI on experienced open-sourc...
       ___________________________________________________________________
        
       Measuring the impact of AI on experienced open-source developer
       productivity
        
       Author : dheerajvs
       Score  : 442 points
       Date   : 2025-07-10 16:29 UTC (6 hours ago)
        
 (HTM) web link (metr.org)
 (TXT) w3m dump (metr.org)
        
       | Jabrov wrote:
       | Very interesting methodology, but the sample size (16) is way too
       | low. Would love to see this repeated with more participants.
        
         | IshKebab wrote:
         | They paid the developers about $75k in total to do this so I
         | wouldn't hold your breath!
        
           | barbazoo wrote:
           | That's a lot of money for many of us. Do you know those folks
           | were in a HCOL area?
        
             | IshKebab wrote:
             | No idea. They don't say who they were; just random popular
             | GitHub projects.
             | 
             | To be clear it wasn't $75k _each_.
        
               | narush wrote:
               | You can see a list of repositories with participating
               | developers in the appendix! Section G.7.
               | 
               | Paper is here: https://metr.org/Early_2025_AI_Experienced
               | _OS_Devs_Study.pdf
        
             | mapt wrote:
             | It isn't a lot of money for industry research. Changes of
             | +-40% in productivity are an enormous
             | advantage/disadvantage for a large tech company moving tens
             | of billions of dollars a year in cashflow through a
             | pipeline that their software engineers built.
        
           | lawlessone wrote:
           | Neat, how to sign up??
        
             | IshKebab wrote:
             | Go back in time, create a popular github repo with lots of
             | stars, be lucky.
        
             | asdff wrote:
             | I see these things posted on linkedin. Usually asking
             | $40/hr though. But essentially the same thing as the OP
             | outlines: you do some domain related task assigned either
             | with or without an AI tool. Check linked in. They will have
             | really vague titles like "data scientist" though even
             | though that's not what is being described, its study
             | subject. Maybe set 40/hr as a filter on linkedin and see if
             | you can get a few to come up.
        
         | narush wrote:
         | Noting that most of our power comes from the number of tasks
         | that developers complete; it's 246 total completed issues in
         | the course of this study -- developers do about 15 issues (7.5
         | with AI and 7.5 without AI) on average.
        
           | biophysboy wrote:
           | Did you compare the variance within individuals (due to
           | treatment) to the variance between individuals (due to other
           | stuff)?
        
       | kokanee wrote:
       | > developers expected AI to speed them up by 24%, and even after
       | experiencing the slowdown, they still believed AI had sped them
       | up by 20%.
       | 
       | I feel like there are two challenges causing this. One is that
       | it's difficult to get good data on how long the same person in
       | the same context would have taken to do a task without AI vs
       | with. The other is that it's tempting to time an AI with metrics
       | like how long until the PR was opened or merged. But the AI
       | workflow fundamentally shifts engineering hours so that a greater
       | percentage of time is spent on refactoring, testing, and
       | resolving issues later in the process, including after the code
       | was initially approved and merged. I can see how it's easy for a
       | developer to report that AI completed a task quickly because the
       | PR was opened quickly, discounting the amount of future work that
       | the PR created.
        
         | qsort wrote:
         | It's really hard to attribute productivity gains/losses to
         | specific technologies or practices, I'm very wary of self-
         | reported anecdotes in any direction precisely because it's so
         | easy to fool ourselves.
         | 
         | I'm not making any claim in either direction, the authors
         | themselves recognize the study's limitations, I'm just trying
         | to say that everyone should have far greater error bars. This
         | technology is the weirdest shit I've seen in my lifetime,
         | making deductions about productivity from anecdotes and dubious
         | benchmarks is basically reading tea leaves.
        
         | yorwba wrote:
         | Figure 21 shows that initial implementation time (which I take
         | to be time to PR) increased as well, although post-review time
         | increased even more (but doesn't seem to have a significant
         | impact on the total).
         | 
         | But Figure 18 shows that time spent actively coding decreased
         | (which might be where the feeling of a speed-up was coming
         | from) and the gains were eaten up by time spent prompting,
         | waiting for and then reviewing the AI output and generally
         | being idle.
         | 
         | So maybe it's not a good idea to use LLMs for tasks that you
         | could've done yourself in under 5 minutes.
        
         | narush wrote:
         | Qualitatively, we don't see a drop in PR quality in between AI-
         | allowed and AI-disallowed conditions in the study; the devs who
         | participate are generally excellent, know their repositories
         | standards super well, and aren't really into the 'get up a bad
         | PR' vibe -- the median review time on the PRs in the study is
         | about a minute.
         | 
         | Developers totally spend time totally differently, though, this
         | is a great callout! On page 10 of the paper [1], you can see a
         | breakdown of how developers spend time when they have AI vs.
         | not - in general, when these devs have AI, they spend a smaller
         | % of time writing code, and a larger % of time working with AI
         | (which... makes sense).
         | 
         | [1]
         | https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
        
         | gitremote wrote:
         | > I feel like there are two challenges causing this. One is
         | that it's difficult to get good data on how long the same
         | person in the same context would have taken to do a task
         | without AI vs with.
         | 
         | The standard experimental design that solves this is to
         | _randomly_ assign participants to the experiment group (with
         | AI) and the control group (without AI), which is what they did.
         | This isolates the variable (with or without AI), taking into
         | account uncontrollable individual, context, and environmental
         | differences. You don 't need to know how the single individual
         | and context would have behaved in the other group. With a large
         | enough sample size and effect size, you can determine
         | statistical significance, and that the with-or-without-AI
         | variable was the only difference.
        
       | dash2 wrote:
       | The authors say "High developer familiarity with repositories" is
       | a likely reason for the surprising negative result, so I wonder
       | if this generalizes beyond that.
        
         | kennywinker wrote:
         | Like if it generalizes to situations where the developer is not
         | familiar with the repo? That doesn't seem like generalizing,
         | that seems like specifying. Am I wrong in saying that the
         | majority of developer time is spent in repos that they're
         | familiar with? Every job and project I've worked has been on a
         | fixed set of repos the entire time. If AI is only helpful for
         | the first week or two on a project, that's not very many cases
         | it's useful for.
        
           | jbeninger wrote:
           | I'd say I write the majority of my code in areas I'm familiar
           | with, but spend the majority of my _time_ on sections I'm not
           | familiar with, and ai helps a lot more with the latter than
           | the former. I've always felt my coding life is speeding
           | through a hundred lines of easy code then getting stuck on
           | the 101st. Then as I get more experienced that hundred
           | becomes 150, then 200, but always speeding through the easy
           | part until I have to learn something new.
           | 
           | So I never feel like I'm getting any faster. 90% of my time
           | is still spent in frustration, even when I'm producing twice
           | the code at higher quality
        
         | add-sub-mul-div wrote:
         | Without the familiarity would the work be getting done
         | effectively? What does it mean for someone to commit AI code
         | that they can't fully understand?
        
       | noisy_boy wrote:
       | It is 80/20 again - it gets you 80% of the way in 20% of the time
       | and then you spend 80% of the time to get the rest of the 20%
       | done. And since it always feels like it is almost there, sunk-
       | cost fallacy comes into play as well and you just don't want to
       | give up.
       | 
       | I think an approach that I tried recently is to use it as a
       | friction remover instead of a solution provider. I do the
       | programming but use it to remove pebbles such as that small bit
       | of syntax I forgot, basically to keep up the velocity. However, I
       | don't look at the wholesale code it offers. I think keeping the
       | active thinking cap on results in code I actually understand
       | while avoiding skill atrophy.
        
         | wmeredith wrote:
         | > and then you spend 80% of the time to get the rest of the 20%
         | done
         | 
         | This was my pr-AI experience anyway, so getting that first
         | chunk of time back is helpful.
         | 
         | Related: One of the better takes I've seen on AI from an
         | experienced developer was, "90% of my skills just became
         | worthless, and the other 10% just became 1,000 times more
         | valuable." There's some hyperbole there, I but I like the gist.
        
           | skydhash wrote:
           | It's not funny when you find yourself redoing the first 80%,
           | as the only way to complete the second 80%.
        
           | bluefirebrand wrote:
           | Let us know if that dev you're talking about winds up working
           | 90% less for the same amount, or earning 1000x more
           | 
           | Otherwise he can shut the fuck up about being 1000x more
           | valuable imo
        
         | emodendroket wrote:
         | I think it's most useful when you basically need Stack Overflow
         | on steroids: I basically know what I want to do but I'm not
         | sure how to achieve it using this environment. It can also be
         | helpful for debugging and rubber ducking generally.
        
           | threetonesun wrote:
           | Absolutely this. For a while I was working with a language I
           | was only partially familiar with, and I'd say "here's how I
           | would do this in [primary language], rewrite it in [new
           | language]" and I'd get a decent piece of code back. A little
           | searching in the project to make sure it was stylistically
           | correct and then done.
        
           | some-guy wrote:
           | All those things are true, but it's such a small part of my
           | workflow at this point that the savings, while nice, aren't
           | nearly as life-changing to my job as my CEO is forcing us to
           | think it is.
           | 
           | Once AI can actually untangle our 14 year old codebase full
           | of hosh-posh code, read every commit message, JIRA ticket,
           | and Slack conversation related to the changes in full
           | context, it's not going to solve a lot of the hard problems
           | at my job.
        
           | skydhash wrote:
           | The issue is that it is slow and verbose, at least in its
           | default configuration. The amount of reading is non trivial.
           | There's a reason most references are dense.
        
             | lukan wrote:
             | Those issues you can partly solve by changing the prompt to
             | tell it to be concise and don't explain its code.
             | 
             | But nothing will make them stick to the one API version I
             | use.
        
               | diggan wrote:
               | > But nothing will make them stick to the one API version
               | I use.
               | 
               | Models trained for tool use can do that. When I use Codex
               | for some Rust stuff for example, it can grep from source
               | files in the directory dependencies are stored, so
               | looking up the current APIs is trivial for them. Same
               | works for JavaScript and a bunch of other languages too,
               | as long as it's accessible somewhere via the tools they
               | have available.
        
               | lukan wrote:
               | Hm, I never tried codex so far, but quite some other
               | tools and models and none could help me in a consistent
               | way. But I am sceptical, because also if I tell them
               | explicitel, to only use one specific version they might
               | or not might use that, depending on their training corpus
               | and temperature I assume.
        
               | malfist wrote:
               | The less verbosity you allow the dumber the LLM is. It
               | thinks in tokens and if you keep it from using tokens
               | it's lobotomized.
        
           | GuinansEyebrows wrote:
           | > rubber ducking
           | 
           | i don't mean to pick on your usage of this specifically, but
           | i think it's noteworthy that the colloquial definition of
           | "rubber ducking" seems to have expanded to include "using a
           | software tool to generate advice/confirm hunches". I always
           | understood the term to mean a personal process of talking
           | through a problem out loud in order to methodically,
           | explicitly understand a theoretical plan/process and expose
           | gaps.
           | 
           | based on a lot of articles/studies i've seen (admittedly
           | haven't dug into them too deeply) it seems like the use of
           | chatbots to perform this type of task actually has negative
           | cognitive impacts on some groups of users - the opposite of
           | the personal value i thought rubber-ducking was supposed to
           | provide.
        
         | eknkc wrote:
         | It works great on adding stuff to an already established
         | codebase. Things like "we have these search parameters, also
         | add foo". Remove anything related to x...
        
           | antonvs wrote:
           | Exactly. If you can give it a contract and a context,
           | essentially, and it doesn't need to write a large amount of
           | code to fulfill it, it can be great.
           | 
           | I just used it to write about 80 lines of new code like that,
           | and there's no question it saves time.
        
         | reverendsteveii wrote:
         | well we used to have a sort of inverse pareto where 80% of the
         | work took 80% of the effort and the remaining 20% of the work
         | also took 80% of the effort.
         | 
         | I do think you're onto something with getting pebbles out of
         | the road inasmuch as once I know what I need to do AI coding
         | makes the doing _much_ faster. Just yesterday I was playing
         | around with removing things from a List object using the Java
         | streams API and I kept running into
         | ConcurrentOperationsExceptions, which happen when multiple
         | threads are mutating the list object at the same time because
         | no thread can guarantee it has the latest copy of the list
         | unaltered by other threads. I spent about an hour trying to
         | write a method that deep copies the list, makes the change and
         | then returns the copy and running into all sorts of problems
         | til I asked AI to build me a thread-safe list mutation method
         | and it was like  "Sure, this is how I'd do it but also the API
         | you're working with already has a method that just....does
         | this." Cases like this are where AI is supremely useful -
         | intricate but well-defined problems.
        
           | cwmoore wrote:
           | Code reuse at scale: 80 + 80 = 160% ~ phi...coincidence?
           | 
           | I think this may become a long horizon harvest for the
           | rigorous OOP strategy, may Bill Joy be disproved.
           | 
           | Gray goo may not [taste] like steel-cut oatmeal.
        
             | Sharlin wrote:
             | It's often said that _p_ is the factor by which one should
             | multiply all estimates - reducing it to _F_ would be a
             | significant improvement in estimation accuracy!
        
             | visarga wrote:
             | 1.6x multiplier is low, we usually need to apply 5x
        
         | 01100011 wrote:
         | As an old dev this is really all I want: a sort of autocorrect
         | for my syntactical errors to save me a couple compile-edit
         | cycles.
        
           | pferde wrote:
           | What I want is not autocorrect, because that won't teach me
           | anything. I want it to yell at me loudly and point to the
           | syntactical error.
           | 
           | Autocorrect is a scourge of humanity.
        
         | causal wrote:
         | Agreed and +1 on "always feels like it is almost there" leading
         | to time sink. AI is especially good at making you feel like
         | it's doing something useful; it takes a lot of skill to discern
         | the truth.
        
         | i_love_retros wrote:
         | The problem is I then have to also figure out the code it wrote
         | to be able to complete the final 20%. I have no momentum and am
         | starting from almost scratch mentally.
        
       | fritzo wrote:
       | As an open source maintainer on the brink of tech debt
       | bankruptcy, I feel like AI is a savior, helping me keep up with
       | rapid changes to dependencies, build systems, release
       | methodology, and idioms.
        
         | aerhardt wrote:
         | But what about producing actual code?
        
           | fritzo wrote:
           | Producing code is overrated. There's lots of old code whose
           | lifetime we can extend.
        
             | fhd2 wrote:
             | Very, very much this.
        
           | resource_waste wrote:
           | I find it useful for simple algorithms and error solving.
        
         | candiddevmike wrote:
         | If you stewarded that much tech debt in the first place, how
         | can you be sure LLM will help prevent it going forward? In my
         | experience, LLMs add more tech debt due to lacking cohesion
         | with it's edits.
        
       | IshKebab wrote:
       | I wonder if the discrepancy is that it felt like it was taking
       | less time because they were having to do less thinking which
       | feels like it is easier and hence faster.
       | 
       | Even so... I still would be really surprised if there wasn't some
       | systematic error here skewing the results, like the developers
       | deliberately picked "easy" tasks that they already knew how to
       | do, so implementing them themselves was particularly fast.
       | 
       | Seems like they authors had about as good methodology as you can
       | get for something like this. It's just really hard to test stuff
       | like this. I've seen studies proving that code comments don't
       | matter for example... are you going to stop writing comments? No.
        
         | narush wrote:
         | > which feels like it is easier and hence faster.
         | 
         | We explore this factor in section (C.2.5) - "Trading speed for
         | ease" - in the paper [1]. It's labeled as a factor with an
         | unclear effect, some developers seem to think so, and others
         | don't!
         | 
         | > like the developers deliberately picked "easy" tasks that
         | they already knew how to do
         | 
         | We explore this factor in (C.2.2) - "Unrepresentative task
         | distribution." I think the effect here is unclear; these are
         | certainly real tasks, but they are sampled from the smaller end
         | of tasks developers would work on. I think the relative effect
         | on AI vs. human performance is not super clear...
         | 
         | [1]
         | https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
        
           | IshKebab wrote:
           | Sounds like you've thought of everything!
        
       | tcdent wrote:
       | This study neglects to incorporate the fact that I have forgotten
       | how to write code.
        
         | resource_waste wrote:
         | I'm curious what space people are working in where AI does
         | their job entirely.
         | 
         | I can use it for parts of code, algorithms, error solving, and
         | maybe sometimes a 'first draft'.
         | 
         | But there is no way I could finish an entire piece of software
         | with AI only.
        
           | asdff wrote:
           | Not a lot of people are empowered to create an entire piece
           | of software. Most are probably in the trenches squashing
           | tickets.
        
             | tcdent wrote:
             | I do create entire pieces of software, and while my
             | workflow is always evolving, it goes something like this:
             | 
             | Define schemas, interfaces, and perhaps some base classes
             | that define the attributes I'm thinking about.
             | 
             | Research libraries that support my cause, and include them.
             | 
             | Reference patterns I have established in other parts of the
             | codebase; internal tooling for database, HTTP services,
             | etc.
             | 
             | Instruct the agent to come up with a plan for a first pass
             | at execution in markdown format. Iterate on this plan;
             | "what about X?"
             | 
             | Splat a bunch of code down that supports the structure I'm
             | looking for. Iterate. Cleanup. Iterate. Implement unit
             | tests and get them to pass.
             | 
             | Go back through everything manually and adjust it to suit
             | my personal style, while at the same time fully
             | understanding what's being done and why.
             | 
             | I use STT a lot to have conversations with the agent as we
             | go, and very rarely allow it to make sequential edits
             | without reviewing first; this is a great opportunity to go
             | back and forth and refine what's being written.
        
               | asdff wrote:
               | You are going well above and beyond what a lot of people
               | do to be fair. There are people in senior roles who are
               | just futzing with json files.
        
             | joks wrote:
             | I think the question still stands.
        
         | narush wrote:
         | Honestly, this is a fair point -- and speaks the difficulty of
         | figuring out the right baseline to measure against here!
         | 
         | If we studied folks with _no_ AI experience, then we might
         | underestimate speedup, as these folks are learning tools (see a
         | discussion of learning effects in section (C.2.7) - Below-
         | average use of AI tools - in the paper). If we studied folks
         | with _only_ AI experience, then we might overestimate speedup,
         | as perhaps these folks can't really program without AI at all.
         | 
         | In some sense, these are just two separate and interesting
         | questions - I'm excited for future work to really dig in on
         | both!
        
       | NewsaHackO wrote:
       | So they paid developers 300 x 246 = about 73K just for developer
       | recruitment for the study, which is not in any academic journal,
       | or has no peer reviews? The underlying paper _looks_ quite
       | polished and not overtly AI generated so I don 't want to say it
       | entirely made up, but how were they even able to get funding for
       | this?
        
         | iLoveOncall wrote:
         | https://metr.org/about Seems like they get paid by AI
         | companies, and they also get government funding.
        
         | narush wrote:
         | Our largest funding was through The Audacious Project -- you
         | can see an announcement here:
         | https://metr.org/blog/2024-10-09-new-support-through-the-aud...
         | 
         | Per our website, "To date, April 2025, we have not accepted
         | compensation from AI companies for the evaluations we have
         | conducted." You can check out the footnote on this page:
         | https://metr.org/donate
        
           | iLoveOncall wrote:
           | This is really disingenuous when you also say that OpenAI and
           | Anthropic have provided you with access and compute credits
           | (on https://metr.org/about).
           | 
           | Not all payment is cash. Compute credits is still by all
           | means compensation.
        
             | gtsop wrote:
             | Are you willing to be compensated with compute credits for
             | your job?
             | 
             | Such companies spit out "credits" all over the place in
             | order to gain traction and enstablish themselves. I
             | remember when cloud providers gave vps credits to startups
             | like they were peanuts. To me, it really means absolutelly
             | nothing.
        
               | bawolff wrote:
               | I wouldn't do my job for $10, but if somehow someone did
               | pay me $10 to do something, i wouldn't claim i wasn't
               | compensated.
               | 
               | In-kind compensation is still compensation.
        
               | iLoveOncall wrote:
               | > Are you willing to be compensated with compute credits
               | for your job?
               | 
               | Well, yes? I use compute for some personal projects so I
               | would be absolutely fine if a part of my compensation was
               | in compute credits.
               | 
               | As a company, even more so.
        
             | dolebirchwood wrote:
             | Is it "really" disingenuous, or is it just a
             | misinterpretation of what it means to be "compensated for"?
             | Seems more like quibbling to me.
        
               | iLoveOncall wrote:
               | I was actually being kind by saying it's disingenuous. I
               | think it's an outright lie.
        
             | golly_ned wrote:
             | Those are compute credits that are directly spent on the
             | experiment itself. It's no more "compensation" than a
             | chemistry researcher being "compensated" with test tubes.
        
               | iLoveOncall wrote:
               | > Those are compute credits that are directly spent on
               | the experiment itself.
               | 
               | You're extrapolating, it's not saying this anywhere.
               | 
               | > It's no more "compensation" than a chemistry researcher
               | being "compensated" with test tubes.
               | 
               | Yes, that's compensation too. Thanks for contributing
               | another example. Here's another one: it's no more
               | compensation than a software engineer being compensated
               | with a new computer.
               | 
               | Actually the situation here is way worse than your
               | example. Unless the chemistry researcher is commissioned
               | by Big Test Tube Corp. to conduct research on the outcome
               | of using their test tubes, there's no conflict of
               | interest here. But there is an obvious conflict of
               | interest on AI research being financed by credits given
               | by AI companies to use their own AI tools.
        
         | bee_rider wrote:
         | Companies produce whitepapers all the time, right? They are
         | typically some combination of technical report, policy
         | suggestion, and advertisement for the organization.
        
         | fabianhjr wrote:
         | Most of the world provides funding for research, the US used to
         | provide funding but now that has been mostly gutted.
        
         | resource_waste wrote:
         | >which is not in any academic journal, or has no peer reviews?
         | 
         | As a philosopher who is into epistemology and ontology, I find
         | this to be as abhorrent as religion.
         | 
         | 'Science' doesn't matter who publishes it. Science needs to be
         | replicated.
         | 
         | The psychology replication crisis is a prime example of why
         | peer reviews and publishing in a journal matters 0.
        
           | bee_rider wrote:
           | > The psychology replication crisis is a prime example of why
           | peer reviews and publishing in a journal matters 0.
           | 
           | Specifically, it works as an example of a specific case where
           | peer review doesn't help as much. Peer review checks your
           | arguments, not your data collection process (which the
           | reviewer can't audit for obvious reasons). It works fine in
           | other scenarios.
           | 
           | Peer review is unrelated to replication problems, except to
           | the extent to which confused people expect peer review to fix
           | totally unrelated replication problems.
        
           | raincole wrote:
           | Peer reviews are very important to filter out obviously low
           | effort stuff.
           | 
           | ...Or should I say "were" very important? With the help of
           | today's GenAI every low effort stuff can look high effort
           | without much extra effort.
        
       | 30minAdayHN wrote:
       | This study focused on experienced OSS maintainers. Here is my
       | personal experience, but a very different persona (or opposite to
       | the one in the study). I always wanted to contribute to OSS but
       | never had time to. Finally was able to do that, thanks to AI.
       | Last month, I was able to contribute to 4 different repositories
       | which I would never have dreamed of doing it. I was using an
       | async coding agent I built[1], to generate PRs given a GitHub
       | issue. Some PRs took a lot of back and forth. And some PRs were
       | accepted as is. Without AI, there is no way I would have
       | contributed to those repositories.
       | 
       | One thing that did work in my favor is that, I was clearly
       | creating a failing repro test case, and adding before and after
       | along with PR. That helped getting the PR landed.
       | 
       | There are also a few PRs that never got accepted because the
       | repro is not as strong or clear.
       | 
       | [1] https://workback.ai
        
       | MYEUHD wrote:
       | This does not mention the open-source developer time wasted while
       | reviewing vibe coded PRs
        
         | narush wrote:
         | Yeah, I'll note that this study does _not_ capture the entire
         | OS dev workflow -- you're totally right that reviewing PRs is a
         | big portion of the time that many maintainers spend on their
         | projects (and thanks to them for doing this [often hard] work).
         | In the paper [1], we explore this factor in more detail -- see
         | section (C.2.2) - Unrepresentative task distribution.
         | 
         | There's some existing lit about increased contributions to OS
         | repositories after the introduction of AI -- I've also
         | personally heard a fear anecdotes about an increase in the
         | number of low-quality PRs from first time contributors,
         | seemingly as a result of AI making it easier to get started --
         | ofc, the tradeoff is that making it easier to get started has
         | pros to it too!
         | 
         | [1]
         | https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
        
       | castratikron wrote:
       | I really like those graphics, does anyone know the tool was used
       | to create them?
        
         | narush wrote:
         | The graphs are all matplotlib. The methodology figure is built
         | in Figma! (Source: I'm a paper author :)).
        
       | narush wrote:
       | Hey HN, study author here. I'm a long-time HN user -- and I'll be
       | in the comments today to answer questions/comments when possible!
       | 
       | If you're short on time, I'd recommend just reading the linked
       | blogpost or the announcement thread here [1], rather than the
       | full paper.
       | 
       | [1] https://x.com/METR_Evals/status/1943360399220388093
        
         | jsnider3 wrote:
         | It's good to know that Claude 3.7 isn't enough to build Skynet!
        
         | causal wrote:
         | Hey I just wanted to say this is one of the better studies I've
         | seen - not clickbaity, very forthright about what is being
         | claimed, and presented in such an easy-to-digest format. Thanks
         | so much for doing this.
        
           | narush wrote:
           | Thanks for the kind words!
        
         | igorkraw wrote:
         | Could you either release the dataset (raw but anonymized) for
         | independent statistical evaluation or at least add the absolute
         | times of each dev per task to the paper? I'm curious what the
         | absolute times of each dev with/without AI was and whether the
         | one guy with lots of Cursor experience was actually faster than
         | the rest of just a slow typer getting a big boost out of llms
         | 
         | Also, cool work, very happy to see actually good evaluations
         | instead of just vibes or observational stuies that don't
         | account for the Hawthorne effect
        
           | narush wrote:
           | Yep, sorry, meant to post this somewhere but forgot in final-
           | paper-polishing-sprint yesterday!
           | 
           | We'll be releasing anonymized data and some basic analysis
           | code to replicate core results within the next few weeks
           | (probably next, depending).
           | 
           | Our GitHub is here (http://github.com/METR/) -- or you can
           | follow us (https://x.com/metr_evals) and we'll probably tweet
           | about it.
        
             | igorkraw wrote:
             | Cool, thanks a lot. Btw, I have a very tiny tiny (50 to 100
             | audience ) podcast where we try to give context to what we
             | call the "muck" of AI discourse (trying to ground claims
             | into both what we would call objectively observable
             | facts/evidence, and then _separately_ giving out own biased
             | takes), if you would be interested to come on it and chat
             | => contact email in my profile.
        
               | ryanar wrote:
               | podcast link?
        
         | antonvs wrote:
         | Was any attention paid to whether the tickets being implemented
         | with AI assistance were an appropriate use case for AI?
         | 
         | If the instruction is just "implement this ticket with AI",
         | then that's very realistic in that it's how management often
         | tries to operate, but it's also likely to be quite suboptimal.
         | There are ways to use AI that help a lot, and other ways that
         | hurt more than it helps.
         | 
         | If your developers had sufficient experience with AI to tell
         | the difference, then they might have compensated for that, but
         | reading the paper I didn't see any indication of that.
        
           | narush wrote:
           | The instructions given to developers was not just "implement
           | with AI" - but rather that they could use AI if they deemed
           | it would be helpful, but indeed did _not need to use AI if
           | they didn't think it would be helpful_. In about ~16% of
           | labeled screen recordings where developers were allowed to
           | use AI, they choose to use no AI at all!
           | 
           | That being said, we can't rule out that the experiment drove
           | them to use more AI than they would have outside of the
           | experiment (in a way that made them less productive). You can
           | see more in section "Experimentally driven overuse of AI
           | (C.2.1)" [1]
           | 
           | [1]
           | https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
        
         | isoprophlex wrote:
         | I'll just say that the methodology of the paper and the
         | professionalism with which you are answering us here is top
         | notch. Great work.
        
           | narush wrote:
           | Thank you!
        
         | JackC wrote:
         | (I read the post but not paper.)
         | 
         | Did you measure subjective fatigue as one way to explain the
         | misperception that AI was faster? As a developer-turned-manager
         | I like AI because it's easier when my brain is tired.
        
           | narush wrote:
           | We attempted to! We explore this more in the section Trading
           | speed for ease (C.2.5) in the paper (https://metr.org/Early_2
           | 025_AI_Experienced_OS_Devs_Study.pdf).
           | 
           | TLDR: mixed evidence that developers make it less effortful,
           | from quantitative and qualitative reports. Unclear effect.
        
       | incomingpain wrote:
       | Essentially an advertisement against Cursor Pro and/or Claude
       | Sonnet 3.5/3.7
       | 
       | I think personally when i tried tools like Void IDE, I was
       | fighting with Void too much. It is beta software, it is buggy,
       | but also the big one... learning curve of the tool.
       | 
       | I havent had the chance to try cursor but i imagine its going to
       | have a learning curve as a new tool.
       | 
       | So perhaps there is a slowdown at first expected; but later after
       | you get your context and prompting down pat. Asking specifically
       | for what you want. Then you get your speed up.
        
       | achenet wrote:
       | I find agents useful for showing me how to do something I don't
       | already know how to do, but I could see how for tasks I'm an
       | expert on, I'd be faster without an extra thing to have to worry
       | about (the AI).
        
       | dboreham wrote:
       | Any time you see the word "measuring" in the context of software
       | development, you know what follows will be nonsense and probably
       | in service of someone's business model.
        
       | simonw wrote:
       | Here's the full paper, which has a lot of details missing from
       | the summary linked above:
       | https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
       | 
       | My personal theory is that getting a significant productivity
       | boost from LLM assistance and AI tools has a much steeper
       | learning curve than most people expect.
       | 
       | This study had 16 participants, with a mix of previous exposure
       | to AI tools - 56% of them had never used Cursor before, and the
       | study was mainly about Cursor.
       | 
       | They then had those 16 participants work on issues (about 15
       | each), where each issue was randomly assigned a "you can use AI"
       | v.s. "you can't use AI" rule.
       | 
       | So each developer worked on a mix of AI-tasks and no-AI-tasks
       | during the study.
       | 
       | A quarter of the participants saw increased performance, 3/4 saw
       | reduced performance.
       | 
       | One of the top performers for AI was also someone with the most
       | previous Cursor experience. The paper acknowledges that here:
       | 
       | > However, we see positive speedup for the one developer who has
       | more than 50 hours of Cursor experience, so it's plausible that
       | there is a high skill ceiling for using Cursor, such that
       | developers with significant experience see positive speedup.
       | 
       | My intuition here is that this study mainly demonstrated that the
       | learning curve on AI-assisted development is high enough that
       | asking developers to bake it into their existing workflows
       | reduces their performance while they climb that learing curve.
        
         | mjr00 wrote:
         | > My intuition here is that this study mainly demonstrated that
         | the learning curve on AI-assisted development is high enough
         | that asking developers to bake it into their existing workflows
         | reduces their performance while they climb that learing curve.
         | 
         | Definitely. Effective LLM usage is not as straightforward as
         | people believe. Two big things I see a lot of developers do
         | when they share chats:
         | 
         | 1. Talk to the LLM like a human. Remember when internet search
         | first came out, and people were literally "Asking Jeeves" in
         | full natural language? Eventually people learned that you don't
         | need to type, "What is the current weather in San Francisco?"
         | because "san francisco weather" gave you the same, or better,
         | results. Now we've come full circle and people talk to LLMs
         | like humans again; not out of any advanced prompt engineering,
         | but just because it's so anthropomorphized it feels natural.
         | But I can assure you that "pandas count unique values column
         | 'Foo'" is just as effective an LLM prompt as "Using pandas, how
         | do I get the count of unique values in the column named 'Foo'?"
         | The LLM is also not insulted by you talking to it like this.
         | 
         | 2. Don't know when to stop using the LLM. Rather than let the
         | LLM take you 80% of the way there and then handle the remaining
         | 20% "manually", they'll keep trying to prompt to get the LLM to
         | generate what they want. Sometimes this works, but often it's
         | just a waste of time and it's far more efficient to just take
         | the LLM output and adjust it manually.
         | 
         | Much like so-called Google-fu, LLM usage is a skill and people
         | who don't know what they're doing are going to get substandard
         | results.
        
           | Jaxan wrote:
           | > Effective LLM usage is not as straightforward as people
           | believe
           | 
           | It is not as straightforward as people are told to believe!
        
             | sleepybrett wrote:
             | ^ this, so much this. The amount of bullshit that gets
             | shoveled into hacker news threads about the supposed
             | capabilities of these models is epic.
        
           | gedy wrote:
           | > Talk to the LLM like a human
           | 
           | Maybe the LLM doesn't strictly need it, but typing out does
           | bring some clarity for the asker. I've found it helps a lot
           | to catch myself - what am I even wanting from this?
        
           | frotaur wrote:
           | I'm not sure about your example about talking to LLMs. There
           | is good reason to think that speaking to it like a human
           | might produce better results, as that's what most of the
           | training data is composed of.
           | 
           | I don't have any studies, but it eems to me reasonable to
           | assume.
           | 
           | (Unlike google, where presumably it actually used keywords
           | anyway)
        
             | mjr00 wrote:
             | > I'm not sure about your example about talking to LLMs.
             | There is good reason to think that speaking to it like a
             | human might produce better results, as that's what most of
             | the training data is composed of.
             | 
             | In practice I have not had any issues getting information
             | out of an LLM when speaking to them like a computer, rather
             | than a human. At least not for factual or code-related
             | information; I'm not sure how it impacts responses for e.g.
             | creative writing, but that's not what I'm using them for
             | anyway.
        
           | lukan wrote:
           | "But I can assure you that "pandas count unique values column
           | 'Foo'" is just as effective an LLM prompt as "Using pandas,
           | how do I get the count of unique values in the column named
           | 'Foo'?""
           | 
           | How can you be so sure? Did you compare in a systematic way
           | or read papers by people who did it?
           | 
           | Now I surely get results giving the llm only snippets and
           | keywords, but anything complex, I do notice differences the
           | way I articulate. Not claiming there _is_ a significant
           | difference, but it seems to me this way.
        
             | mjr00 wrote:
             | > How can you be so sure? Did you compare in a systematic
             | way or read papers by people who did it?
             | 
             | No, but I didn't need to read scientific papers to figure
             | how to use Google effectively, either. I'm just using a
             | results-based analysis after a lot of LLM usage.
        
               | lukan wrote:
               | Well, I did needed some tutorials to use google
               | efficently in the old days when + meant something
               | specific.
        
               | skybrian wrote:
               | Other people don't have benefit of your experience,
               | though, so there's a communications gap here: this boils
               | down to "trust me, bro."
               | 
               | How do we get beyond that?
        
               | mjr00 wrote:
               | This is the gap between capability (what can this tool
               | do?) versus workflow (what is the best way to use this
               | tool to accomplish a goal?). Capabilities can be strictly
               | evaluated, but workflow is subjective. Saying "Google has
               | the site: and before: operators" is capability, saying
               | "you should use site:reddit.com before:2020 in Google
               | queries" is workflow.
               | 
               | LLMs have made the distinction ambiguous because their
               | capabilities are so poorly understood. When I say "you
               | should talk to an LLM like it's a computer", that's a
               | workflow statement; it's a more efficient way to
               | accomplish the same goal. You can try it for yourself and
               | see if you agree. I personally liken people who talk to
               | LLMs in full, proper English, capitalization and all, to
               | boomers who still type in full sentences when running a
               | Google query. Is there anything _strictly_ wrong with it?
               | Not really. Do I believe it 's a more efficient workflow
               | to just type the keywords that will give you the same
               | result? Yes.
               | 
               | Workflow efficiencies can't really be scientifically
               | evaluated. Some people still prefer to have desktop icons
               | for programs on Windows; my workflow is pressing winkey
               | -> typing the first few characters of the program ->
               | enter. Is one of these methods scientifically more
               | correct? Not really.
               | 
               | So, yeah -- eventually you'll either find your own
               | workflow or copy the workflow of someone you see who is
               | using LLMs effectively. It really _is_ "just trust me,
               | bro."
        
               | skybrian wrote:
               | Maybe it would help if more people wrote tutorials? It
               | doesn't seem reasonable for people who don't have a buddy
               | to learn from to have to figure it out on their own.
        
           | bit1993 wrote:
           | > Rather than let the LLM take you 80% of the way there and
           | then handle the remaining 20% "manually"
           | 
           | IMO 80% is way too much, LLMs are probably good for things
           | that are not your domain knowledge and you can efford to not
           | be 100% correct, like rendering the Mandelbrot set, simple
           | functions like that.
           | 
           | LLMs are not deterministic sometimes they produce correct
           | code and other times they produce wrong code. This means one
           | has to audit LLM generated code and auditing code takes more
           | effort than writing it, especially if you are not the
           | original author of the code being audited.
           | 
           | Code has to be 100% deterministic. As programmers we write
           | code, detailed instructions for the computer (CPU), we have
           | developed allot of tools such as Unit Tests to make sure the
           | computer does exactly what we wrote.
           | 
           | A codebase has allot of context that you gain by writing the
           | code, some things just look wrong and you know exactly why
           | because you wrote the code, there is also allot of context
           | that you should keep in your head as you write the code,
           | context that you miss from simply prompting an LLM.
        
         | narush wrote:
         | Hey Simon -- thanks for the detailed read of the paper - I'm a
         | big fan of your OS projects!
         | 
         | Noting a few important points here:
         | 
         | 1. Some prior studies that find speedup do so with developers
         | that have similar (or less!) experience with the tools they
         | use. In other words, the "steep learning curve" theory doesn't
         | differentially explain our results vs. other results.
         | 
         | 2. Prior to the study, 90+% of developers had reasonable
         | experience prompting LLMs. Before we found slowdown, this was
         | the only concern that most external reviewers had about
         | experience was about prompting -- as prompting was considered
         | the primary skill. In general, the standard wisdom was/is
         | Cursor is very easy to pick up if you're used to VSCode, which
         | most developers used prior to the study.
         | 
         | 3. Imagine all these developers had a TON of AI experience. One
         | thing this might do is make them worse programmers when not
         | using AI (relatable, at least for me), which in turn would
         | raise the speedup we find (but not because AI was better, but
         | just because with AI is much worse). In other words, we're
         | sorta in between a rock and a hard place here -- it's just
         | plain hard to figure out what the right baseline should be!
         | 
         | 4. We shared information on developer prior experience with
         | expert forecasters. Even with this information, forecasters
         | were still dramatically over-optimistic about speedup.
         | 
         | 5. As you say, it's totally possible that there is a long-tail
         | of skills to using these tools -- things you only pick up and
         | realize after hundreds of hours of usage. Our study doesn't
         | really speak to this. I'd be excited for future literature to
         | explore this more.
         | 
         | In general, these results being surprising makes it easy to
         | read the paper, find one factor that resonates, and conclude
         | "ah, this one factor probably just explains slowdown." My
         | guess: there is no one factor -- there's a bunch of factors
         | that contribute to this result -- at least 5 seem likely, and
         | at least 9 we can't rule out (see the factors table on page
         | 11).
         | 
         | I'll also note that one really important takeaway -- that
         | developer self-reports after using AI are overoptimistic to the
         | point of being on the wrong side of speedup/slowdown -- isn't a
         | function of which tool they use. The need for robust, on-the-
         | ground measurements to accurately judge productivity gains is a
         | key takeaway here for me!
         | 
         | (You can see a lot more detail in section C.2.7 of the paper
         | ("Below-average use of AI tools") -- where we explore the
         | points here in more detail.)
        
           | simonw wrote:
           | Thanks for the detailed reply! I need to spend a bunch more
           | time with this I think - above was initial hunches from
           | skimming the paper.
        
             | narush wrote:
             | Sounds great. Looking forward to hearing more detailed
             | thoughts -- my emails in the paper :)
        
           | paulmist wrote:
           | Were participants given time to customize their Cursor
           | settings? In my experience tool/convention mismatch kills
           | Cursor's productivity - once it gets going with a wrong
           | library or doesn't use project's functions I will almost
           | always reject code and re-prompt. But, especially for large
           | projects, having a well-crafted repo prompt mitigates most of
           | these issues.
        
           | jdp23 wrote:
           | Really interesting paper, and thanks for the followon points.
           | 
           | The over-optimism is indeed a really important takeaway, and
           | agreed that it's not tool-dependent.
        
           | gojomo wrote:
           | Did each developer do a large enough mix of AI/non-AI tasks,
           | in varying orders, that you have any hints in your data
           | whether the "AI penalty" grew or shrunk over time?
        
             | narush wrote:
             | You can see this analysis in the factor analysis of "Below-
             | average use of AI tools" (C.2.7) in the paper [1], which we
             | mark as an unclear effect.
             | 
             | TLDR: over the first 8 issues, developers do not appear to
             | get majorly less slowed down.
             | 
             | [1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Stud
             | y.pdf
        
               | gojomo wrote:
               | Thanks, that's great!
               | 
               | But: if all developers did 136 AI-assisted issues, why
               | only analyze excluding the 1st 8, rather than, say, the
               | first 68 (half)?
        
               | narush wrote:
               | Sorry, this is the first 8 issues per-developer!
        
           | amirhirsch wrote:
           | Figure 6 which breaks-down the time spent doing different
           | tasks is very informative -- it suggest: 15% less active
           | coding 5% less testing, 8% less research and reading
           | 
           | 4% more idle time 20% more AI interaction time
           | 
           | The 28% less coding/testing/research is why developers
           | reported 20% less work. You might be spending 20% more time
           | overall "working" while you are really idle 5% more time and
           | feel like you've worked less because you were drinking coffee
           | and eating a sandwich between waiting for the AI and reading
           | AI output.
           | 
           | I think the AI skill-boost comes from having work flows that
           | let you shave half that git-ops time, cut an extra 5% off
           | coding, but cut the idle/waiting and do more prompting of
           | parallel agents and a bit more testing then you really are a
           | 2x dev.
        
             | amirhirsch wrote:
             | i just realized the figure is showing the time breakdown as
             | a percentage of total time, it would be more useful to show
             | absolute time (hours) for those side-by-side comparisons
             | since the implied hours would boost the AI bars height by
             | 18%
        
               | narush wrote:
               | There's additional breakdown per-minute in the appendix
               | -- see appendix E.4!
        
         | smokel wrote:
         | I notice that some people have become more productive thanks to
         | AI tools, while others are not.
         | 
         | My working hypothesis is that people who are fast at scanning
         | lots of text (or code for that matter) have a serious
         | advantage. Being able to dismiss unhelpful suggestions quickly
         | and then iterating to get to helpful assistance is key.
         | 
         | Being fast at scanning code correlates with seniority, but
         | there are also senior developers who can write at a solid pace,
         | but prefer to take their time to read and understand code
         | thoroughly. I wouldn't assume that this kind of developer gains
         | little profit from typical AI coding assistance. There are also
         | juniors who can quickly read text, and possibly these have an
         | advantage.
         | 
         | A similar effect has been around with being able to quickly
         | "Google" something. I wouldn't be surprised if this is the same
         | trait at work.
        
           | luxpir wrote:
           | Just to thank you for that point. I think it's likely more
           | true than most of us realise. That and maybe the ability to
           | mentally scaffold or outline a system or solution ahead of
           | time.
        
           | Filligree wrote:
           | An interesting point. I wonder how much my decades-old habit
           | of watching subtitled anime helps there--it's definitely made
           | me dramatically faster at scanning text.
        
         | onlyrealcuzzo wrote:
         | How were "experienced engineers" defined?
         | 
         | I've found AI to be quite helpful in pointing me in the right
         | direction when navigating an entirely new code-base.
         | 
         | When it's code I already know like the back of my hand, it's
         | not super helpful, other than maybe doing a few automated tasks
         | like refactoring, where there have already been some good tools
         | for a while.
        
           | smj-edison wrote:
           | > To directly measure the real-world impact of AI tools on
           | software development, we recruited 16 experienced developers
           | from large open-source repositories (averaging 22k+ stars and
           | 1M+ lines of code) that they've contributed to for multiple
           | years.
        
         | furyofantares wrote:
         | > My personal theory is that getting a significant productivity
         | boost from LLM assistance and AI tools has a much steeper
         | learning curve than most people expect.
         | 
         | I totally agree with this. Although also, you can end up in a
         | bad spot even after you've gotten pretty good at getting the AI
         | tools to give you good output, because you fail to learn the
         | code you're producing well.
         | 
         | A developer gets better at the code they're working on over
         | time. An LLM gets worse.
         | 
         | You can use an LLM to write a lot of code fast, but if you
         | don't pay enough attention, you aren't getting any better at
         | the code while the LLM is getting worse. This is why you can
         | get like two months of greenfield work done in a weekend but
         | then hit a brick wall - you didn't learn anything about the
         | code that was written, and while the LLM started out producing
         | reasonable code, it got worse until you have a ball of mud that
         | neither the LLM nor you can effectively work on.
         | 
         | So a really difficult skill in my mind is continually avoiding
         | temptation to vibe. Take a whole week to do a month's worth of
         | features, not a weekend to do two month's worth, and put in the
         | effort to guide the LLM to keep producing clean code, and to be
         | sure you know the code. You do want to know the code and you
         | can't do that without putting in work yourself.
        
           | danieldk wrote:
           | _So a really difficult skill in my mind is continually
           | avoiding temptation to vibe._
           | 
           | I agree. I have found that I can use agents most effectively
           | by letting it write code in small steps. After each step I do
           | review of the changes and polish it up (either by doing the
           | fixups myself or prompting). I have found that this helps me
           | understanding the code, but also avoids that the model gets
           | in a bad solution space or produces unmaintainable code.
           | 
           | I also think this kind of close-loop is necessary. Like
           | yesterday I let an LLM write a relatively complex data
           | structure. It got the implementation nearly correct, but was
           | stuck, unable to find an off-by-one comparison. In this case
           | it was easy to catch because I let it write property-based
           | tests (which I had to fix up to work properly), but it's easy
           | for things to slip through the cracks if you don't review
           | carefully.
           | 
           | (This is all using Cursor + Claude 4.)
        
           | bluefirebrand wrote:
           | > Take a whole week to do a month's worth of features
           | 
           | Everything else in your post is so reasonable and then you
           | still somehow ended up suggesting that LLMs should be
           | quadrupling our output
        
             | furyofantares wrote:
             | I'm specifically talking about greenfield work. I do a lot
             | of game prototypes, it definitely does that at the very
             | beginning.
        
               | bluefirebrand wrote:
               | Greenfield is still such a tiny percentage of all
               | software work going on in the world though :/
        
               | furyofantares wrote:
               | I agree, that's fair. I think a lot of people are playing
               | around with AI on side projects and making some bad
               | extrapolations from their initial experiences.
               | 
               | It'll also apply to isolated-enough features, which is
               | still a small amount of someone's work (not often
               | something you'd work on for a full month straight), but
               | more people will have experience with this.
        
               | lurking_swe wrote:
               | greenfield development is also the "easiest" and most fun
               | part of software development. As the famous saying goes,
               | the last 10% of the project takes 90% of the time lol.
               | 
               | I've also noticed that, generally, nobody likes
               | maintaining old systems.
               | 
               | so where does this leave us as software engineers? Should
               | I be excited that it's easy to spin up a bunch of code
               | that I don't deeply understand at the beginning of my
               | project, while removing the fun parts of the project?
               | 
               | I'm still grappling with what this means for our industry
               | in 5-10 years...
        
               | Filligree wrote:
               | It's a tiny percentage of software work because the
               | programming is slow, and setting up new projects is even
               | slower.
               | 
               | It's been a majority of my projects for the past two
               | months. Not because work changed, but because I've
               | written a dozen tiny, personalised tools that I wouldn't
               | have written at all if I didn't have Claude to do it.
               | 
               | Most of them were completed in less than an hour, to give
               | you an idea of the size. Though it would have easily been
               | a day on my own.
        
               | Dzugaru wrote:
               | This is really interesting, because I do gamejams from
               | time to time - and I try every time to make it work, but
               | I'm still quite a lot faster doing stuff myself.
               | 
               | This is visible under extreme time pressure of producing
               | a working game in 72 hours (our team scores consistenly
               | top 100 in Ludum Dare which is a somewhat high standard).
               | 
               | We use a popular Unity game engine all LLMs have wealth
               | of experience (as in game development in general), but
               | the output is 80% so strangely "almost correct but not
               | usable" that I cannot take the luxury of letting it
               | figure it out, and use it as fancy autocomplete. And I
               | also still check docs and Stackoverflow-style forums a
               | lot, because of stuff it plainly mades up.
               | 
               | One of the reasons is maybe our game mechanics often is a
               | bit off the beaten road, though the last game we made was
               | literally a platformer with rope physics (LLM could not
               | produce a good idea how to make stable and simple rope
               | physics under our constraints codeable in 3 hours time).
        
           | WD-42 wrote:
           | I feel the same way. I use it for super small chunks, still
           | understand everything it outputs, and often manually
           | copy/paste or straight up write myself. I don't know if I'm
           | actually faster before, but it feels more comfy than alt-
           | tabbing to stack overflow, which is what I feel like it's
           | mostly replaced.
           | 
           | Poor stack overflow, it looks like they are the ones really
           | hurting from all this.
        
           | jona777than wrote:
           | > but then hit a brick wall
           | 
           | This is my intuition as well. I had a teammate use a pretty
           | good analogy today. He likened vibe coding to vacuuming up a
           | string in four tries when it only takes one try to reach down
           | and pick it up. I thought that aligned well with my
           | experience with LLM assisted coding. We have to vacuum the
           | floor while exercising the "difficult skill [of] continually
           | avoiding temptation to vibe"
        
         | Uehreka wrote:
         | > My personal theory is that getting a significant productivity
         | boost from LLM assistance and AI tools has a much steeper
         | learning curve than most people expect.
         | 
         | You hit the nail on the head here.
         | 
         | I feel like I've seen a lot of people trying to make strong
         | arguments that AI coding assistants aren't useful. As someone
         | who uses and enjoys AI coding assistants, I don't find this
         | research angle to be... uh... very grounded in reality?
         | 
         | Like, if you're using these things, the fact that they are
         | useful is pretty irrefutable. If one thinks there's some sort
         | of "productivity mirage" going on here, well OK, but to
         | demonstrate that it might be better to start by acknowledging
         | areas where they _are_ useful, and show that your method
         | explains the reality we're seeing before using that method to
         | show areas where we might be fooling ourselves.
         | 
         | I can maybe buy that AI might not be useful for certain kinds
         | of tasks or contexts. But I keep pushing their boundaries and
         | they keep surprising me with how capable they are, so it feels
         | like it'll be difficult to prove otherwise in a durable
         | fashion.
        
           | TechDebtDevin wrote:
           | Still odd to me that the only vibe coded software that gets
           | aquired are by companies selling tools or want to promote
           | vibe coding.
        
             | furyofantares wrote:
             | That's not odd. These things are incredibly useful and vibe
             | coding mostly sucks.
        
             | Uehreka wrote:
             | Pardon my caps, but WHO CARES about acquisitions?!
             | 
             | You've been given a dubiously capable genie that can write
             | code without you having to do it! If this thing can build
             | first drafts of those side projects you always think about
             | and never get around to, that in and of itself is useful!
             | If it can do the yak-shaving required to set up those e2e
             | tests you know you should have but never have time for it
             | is useful!
             | 
             | Have it try out all the dumb ideas you have that might be
             | cool but don't feel worth your time to boilerplate out!
             | 
             | I like to think we're a bunch of creative people here! Stop
             | thinking about how it can make you money and use it for
             | fun!
        
               | fwip wrote:
               | Unfortunately, HN is YC-backed, and attracts these types
               | by design.
        
               | Uehreka wrote:
               | I mean sure, but HN/YC's founder was always going on
               | about the kinship between "Hackers and Painters" (or at
               | least he used to). It hasn't always been like this, and
               | definitely doesn't have to be. We can and should aspire
               | to better.
        
           | furyofantares wrote:
           | I think the thing is there IS a learning curve, AND there is
           | a productivity mirage, AND they are immensely useful, AND it
           | is context dependent. All of this leads to a lot of confusion
           | when communicating with people who are having a different
           | experience.
        
             | GoatInGrey wrote:
             | It always comes back to nuance!
        
             | Uehreka wrote:
             | Right, my problem is that while some people may be correct
             | about the productivity mirage, many of those people are
             | getting out over their skis and making bigger claims than
             | they can reasonably prove. I'm arguing that they should be
             | more nuanced and tactical.
        
           | rcruzeiro wrote:
           | Exactly. The people who say that these assistants are useless
           | or "not good enough" are basically burying their heads in the
           | sand. The people who claim that there is no mirage are
           | burying their head in the sand as well...
        
         | grey-area wrote:
         | Well, there are two possible interpretations here of 75% of
         | participants (all of whom had some experience using LLMs) being
         | slower using generative AI:
         | 
         | LLMs have a v. steep and long learning curve as you posit
         | (though note the points from the paper authors in the other
         | reply).
         | 
         | Current LLMs just are not as good as they are sold to be as a
         | programming assistant and people consistently predict and self-
         | report in the wrong direction on how useful they are.
        
           | Terr_ wrote:
           | > people consistently predict and self-report in the wrong
           | direction
           | 
           | I recall an adage about work-estimation: As chunks get too
           | big, people unconsciously substitute "how possible does the
           | final outcome feel" with "how long will the work take to do."
           | 
           | People asked "how long did it take" could be substituting
           | something else, such as "how alone did I feel while working
           | on it."
        
             | sandinmyjoints wrote:
             | That's an interesting adage. Any ideas of its source?
        
               | Dilettante_ wrote:
               | It might have been in Kahneman's "Thinking, Fast and
               | Slow"
        
               | Terr_ wrote:
               | I'm not sure, but something involving Kahneman _et al._
               | seems very plausible: The relevant term is probably
               | "Attribute Substitution."
               | 
               | https://en.wikipedia.org/wiki/Attribute_substitution
        
           | steveklabnik wrote:
           | > Current LLMs
           | 
           | One thing that happened here is that they aren't using
           | current LLMs:
           | 
           | > Most issues were completed in February and March 2025,
           | before models like Claude 4 Opus or Gemini 2.5 Pro were
           | released.
           | 
           | That doesn't mean this study is bad! In fact, I'd be very
           | curious to see it done again, but with newer models, to see
           | if that has an impact.
        
             | blibble wrote:
             | > One thing that happened here is that they aren't using
             | current LLMs
             | 
             | I've been hearing this for 2 years now
             | 
             | the previous model retroactively becomes total dogshit the
             | moment a new one is released
             | 
             | convenient, isn't it?
        
               | simonw wrote:
               | The previous model retroactively becomes not as good as
               | the best available models. I don't think that's a huge
               | surprise.
        
               | cwillu wrote:
               | The surprise is the implication that the crossover
               | between net-negative and net-positive impact happened to
               | be in the last 4 months, in light of the initial release
               | 2 years ago and sufficient public attention for a study
               | to be funded and completed.
               | 
               | Yes, it might make a difference, but it _is_ a little
               | tiresome that there 's _always_ a "this is based on a
               | model that is x months old!" comment, because it will
               | always be true: an academic study does not get funded,
               | executed, written up, and published in less time.
        
               | Ntrails wrote:
               | Some of it is just that (probably different) people said
               | the same damn things 6 months ago.
               | 
               | "No, the 2.8 release is the first good one. It massively
               | improves workflows"
               | 
               | Then, 6 months later, the study comes out.
               | 
               | "Ah man, 2.8 was useless, 3.0 really crossed the
               | threshold on value add"
               | 
               | At some point, you roll your eyes and assume it is just
               | snake oil sales
        
               | Filligree wrote:
               | Or you accept that different people have different skill
               | levels, workflows and goals, and therefore the AIs reach
               | usability at different times.
        
               | steveklabnik wrote:
               | There's a lot of confounding factors here. For example,
               | you could point to any of these things in the last ~8
               | months as being significant changes:
               | 
               | * the release of agentic workflow tools
               | 
               | * the release of MCPs
               | 
               | * the release of new models, Claude 4 and Gemini 2.5 in
               | particular
               | 
               | * subagents
               | 
               | * asynchronous agents
               | 
               | All or any of these could have made for a big or small
               | impact. For example, I'm big on agentic tools, skeptical
               | of MCPs, and don't think we yet understand subagents.
               | That's different from those who, for example, think MCPs
               | are the future.
               | 
               | > At some point, you roll your eyes and assume it is just
               | snake oil sales
               | 
               | No, you have to realize you're talking to a population of
               | people, and not necessarily the same person. Opinions are
               | going to vary, they're not literally the same person each
               | time.
               | 
               | There are surely snake oil salesman, but you can't buy
               | anything from me.
        
               | foobarqux wrote:
               | That's not the argument being made though, which is that
               | it does "work" _now_ and implying that actually it didn
               | 't quite work before; except that that is the same thing
               | the same people say for every model release, including at
               | the time or release of the previous one, which is now
               | acknowledged to be seriously flawed; and including the
               | future one, at which time the current models will
               | similarly be acknowledged to be, not only less performant
               | that the future models, but inherently flawed.
               | 
               | Of course it's possible that at some point you get to a
               | model that really works, irrespective of the history of
               | false claims from the zealots, but it does mean you
               | should take their comments with a grain of salt.
        
               | steveklabnik wrote:
               | > That's not the argument being made though, which is
               | that it does "work" now and implying that actually it
               | didn't quite work before
               | 
               | Right.
               | 
               | > except that that is the same thing the same people say
               | for every model release,
               | 
               | I did not say that, no.
               | 
               | I am sure you can find someone who is in a Groundhog Day
               | about this, but it's just simpler than that: as tools
               | improve, more people find them useful than before. You're
               | not talking to the same people, you are talking to new
               | people each time who now have had their threshold
               | crossed.
        
               | blibble wrote:
               | > You're not talking to the same people, you are talking
               | to new people each time who now have had their threshold
               | crossed.
               | 
               | no, it's the same names, again and again
        
               | simonw wrote:
               | Got receipts?
               | 
               | That sounds like a claim you could back up with a little
               | bit of time spent using Hacker News search or similar.
               | 
               | (I might try to get a tool like o3 to run those searches
               | for me.)
        
               | blibble wrote:
               | try asking it what sealioning is
        
               | pdabbadabba wrote:
               | Maybe it's convenient. But isn't it also just a fact that
               | some of the models available today are better than the
               | ones available five months ago?
        
               | bryanrasmussen wrote:
               | sure, but after having spent some time trying to get
               | anything useful - programmatically - out of previous
               | models and not getting anything once a new one is
               | announced how much time should one spend.
               | 
               | Sure you may end up missing out on a good thing and then
               | having to come late to the party, but coming early to the
               | party too many times and the beer is watered down and the
               | food has grubs is apt to make you cynical the next time a
               | party announcement comes your way.
        
               | Terr_ wrote:
               | Plus it's not even _possible_ to miss the metaphorical
               | party: If it gets going, it will be quite obvious long
               | before it peaks.
               | 
               | (Unless one believes the most grandiose prophecies of a
               | technological-singularity apocalypse, that is.)
        
               | Terr_ wrote:
               | That's not the issue. Their complaint is that proponents
               | keep revising what ought to be _fixed_ goalposts... Well,
               | fixed unless you believe unassisted human developers are
               | _also_ getting dramatically better at their jobs every
               | year.
               | 
               | Like the boy who cried wolf, it'll _eventually_ be true
               | with enough time... But we should stop giving them the
               | benefit of the doubt.
               | 
               | _____
               | 
               | Jan 2025: "Ignore last month's models, they aren't good
               | enough to show a marked increase in human productivity,
               | test with _this_ month 's models and the benefits are
               | obvious."
               | 
               | Feb 2025: "Ignore last month's models, they aren't good
               | enough to show a marked increase in human productivity,
               | test with _this_ month 's models and the benefits are
               | obvious."
               | 
               | Mar 2025: "Ignore last month's models, they aren't good
               | enough to show a marked increase in human productivity,
               | test with _this_ month 's models and the benefits are
               | obvious."
               | 
               | Apr 2025: [Ad nauseam, you get the idea]
        
               | pdabbadabba wrote:
               | Fair enough. For what it's worth, I've always thought
               | that the more reasonable claim is that AI tools make
               | poor-average developers more productive, not necessarily
               | _expert_ developers.
        
               | steveklabnik wrote:
               | Sorry, that's not my take. I didn't think these tools
               | were useful _until_ the latest set of models, that is,
               | they crossed the threshold of usefulness to me.
               | 
               | Even then though, "technology gets better over time"
               | shouldn't be surprising, as it's pretty common.
        
               | mattmanser wrote:
               | Do you really see a massive jump?
               | 
               | For context, I've been using AI, a mix of OpenAi +
               | Claude, mainly for bashing out quick React stuff. For
               | over a year now. Anything else it's generally rubbish and
               | slower than working without. Though I still use it to
               | rubber duck, so I'm still seeing the level of quality for
               | backend.
               | 
               | I'd say they're only marginally better today than they
               | were even 2 years ago.
               | 
               | Every time a new model comes out you get a bunch of
               | people raving how great the new one is and I honestly
               | can't really tell the difference. The only real
               | difference is reasoning models actually slowed everything
               | down, but now I see its reasoning. It's only useful
               | because I often spot it leaving out important stuff from
               | the final answer.
        
               | hombre_fatal wrote:
               | I see a massive jump every time.
               | 
               | Just two years ago, this failed.
               | 
               | > Me: What language is this: "esto esta escrito en
               | ingles"
               | 
               | > LLM: English
               | 
               | Gemini and Opus have solved questions that took me weeks
               | to solve myself. And I'll feed some complex code into
               | each new iteration and it will catch a race condition I
               | missed even with testing and line by line scrutiny.
               | 
               | Consider how many more years of experience you need as a
               | software engineer to catch hard race conditions just from
               | reading code than someone who couldn't do it after trying
               | 100 times. We take it for granted already since we see it
               | as "it caught it or it didn't", but these are massive
               | jumps in capability.
        
               | steveklabnik wrote:
               | Yes. In January I would have told you AI tools are
               | bullshit. Today I'm on the $200/month Claude Max plan.
               | 
               | As with anything, your miles may vary: I'm not here to
               | tell anyone that thinks they still suck that their
               | experience is invalid, but to me it's been a pretty big
               | swing.
        
               | Uehreka wrote:
               | > In January I would have told you AI tools are bullshit.
               | Today I'm on the $200/month Claude Max plan.
               | 
               | Same. For me the turning point was VS Code's Copilot
               | Agent mode in April. That changed everything about how I
               | work, though it had a lot of drawbacks due to its
               | glitches (many of these were fixed within 6 or so weeks).
               | 
               | When Claude Sonnet 4 came out in May, I could immediately
               | tell it was a step-function increase in capability. It
               | was the first time an AI, faced with ambiguous and
               | complicated situations, would be willing to answer a
               | question with a definitive and confident "No".
               | 
               | After a few weeks, it became clear that VS Code's
               | interface and usage limits were becoming the bottleneck.
               | I went to my boss, bullet points in hand, and easily got
               | approval for the Claude Max $200 plan. Boom, another
               | step-function increase.
               | 
               | We're living in an incredibly exciting time to be a
               | skilled developer. I understand the need to stay
               | skeptical and measure the real benefits, but I feel like
               | a lot of people are getting caught up in the culture war
               | aspect and are missing out on something truly wonderful.
        
               | mattmanser wrote:
               | Ok, I'll have to try it out then. I've got a side project
               | I've 3/4 finished and will let it loose on it.
               | 
               | So are you using Claude Code via the max plan, Cursor, or
               | what?
               | 
               | I think I'd definitely hit AI news exhaustion and was
               | viewing people raving about this agentic stuff as yet
               | more AI fanbois. I'd just continued using the AI separate
               | as setting up a new IDE seemed like too much work for the
               | fractional gains I'd been seeing.
        
               | steveklabnik wrote:
               | I had a bad time with Cursor. I use Claude Code inside of
               | VS: Code. You don't necessarily need Max, but you can
               | spend a lot of money very quickly on API tokens, so I'd
               | recommend to anyone trying, start with the $20/month one,
               | no need to spend a ton of money just to try something
               | out.
               | 
               | There is a skill gap, like, I think of it like vim: at
               | first it slows you down, but then as you learn it, you
               | end up speeding up. So you may also find that it doesn't
               | really vibe with the way you work, even if I am having a
               | good time with it. I know people who are great engineers
               | who still don't like this stuff, just like I know ones
               | that do too.
        
               | simonw wrote:
               | The massive jump in the last six months is that the new
               | set of "reasoning" models got really good at reasoning
               | about when to call tools, and were accompanied is by a
               | flurry of tools-in-loop coding agents - Claude Code,
               | OpenAI Codex, Cursor in Agent mode etc.
               | 
               | An LLM that can test the code it is writing and then
               | iterate to fix the bugs turns out to be a huge step
               | forward from LLMs that just write code without trying to
               | then exercise it.
        
               | vidarh wrote:
               | I've gone from asking the tools how to do things, and cut
               | and pasting the bits (often small) that'd be helpful, via
               | using assistants that I'd review every decision of and
               | often having to start over, to now often starting an
               | assistant with broad permissions and just reviewing the
               | diff later, after they've made the changes pass the test
               | suite, run a linter and fixed all the issues it brought
               | up, and written a draft commit message.
               | 
               | The jump has been massive.
        
               | ipaddr wrote:
               | Wait until the next set. You will find you the previous
               | ones weren't useful after all.
        
               | steveklabnik wrote:
               | This makes no sense to me. I'm well aware that I'm
               | getting value today, that's not going to change in the
               | future: it's already happened.
               | 
               | Sure they may get _even more_ useful in the future but
               | that doesn't change my present.
        
               | jstummbillig wrote:
               | Convenient for whom and what...? There is nothing
               | tangible to gain from you believing or not believing that
               | someone else does (or does not) get a productivity boost
               | from AI. This is not a religion and it's not crypto. The
               | AI users' net worth is not tied to another ones use of or
               | stance on AI (if anything, it's the opposite).
               | 
               | More generally, the phenomenon this is quite simply
               | explained and nothing surprising: New things improve,
               | quickly. That does not mean that something is good or
               | valuable but it's how new tech gets introduced every
               | single time, and readily explains changing sentiment.
        
               | card_zero wrote:
               | I saw that edit. Indeed you can't predict that rejecting
               | a new thing is part of a routine of being wrong. It's
               | true that "it's strange and new, therefore I hate it" is
               | a very human (and adorable) instinct, but sometimes it's
               | reasonable.
        
               | jstummbillig wrote:
               | "I saw that edit" lol
        
               | card_zero wrote:
               | Sorry, just happened to. Slightly rude of me.
        
               | jstummbillig wrote:
               | Ah, you do you. It's just a fairly kindergarten thing to
               | point out and not something I was actively trying to
               | hide. Whatever it was.
               | 
               | Generally, I do a couple of edits for clarity after
               | posting and reading again. Sometimes that involves
               | removing something that I feel could have been said
               | better. If it does not work, I will just delete the
               | comment. Whatever it was must not have been a super huge
               | deal (to me).
        
               | grey-area wrote:
               | Honestly the hype cycle feels very like crypto, and just
               | like crypto prominent vcs have a lot of money riding on
               | the outcome.
        
               | steveklabnik wrote:
               | I agree with you, and I think that's coloring a lot of
               | people's perceptions. I am not a crypto fan but am an LLM
               | fan.
               | 
               | Every hype cycle feels like this, and some of them are
               | nonsense and some of them are real. We'll see.
        
               | jstummbillig wrote:
               | Of course, lot's of hype, but my point is that the reason
               | why is very different and it matters: As an early bc
               | adopter making your believe in bc is super important to
               | my net worth (and you not believing in bc makes me look
               | like an idiot and lose a lot of money).
               | 
               | In contrast, what do I care if you believe in code
               | generation AI? If you do, you are probably driving up
               | pricing. I mean, I am sure that there are people that
               | care very much, but there is little inherent value for me
               | in you doing so, as long as the people who are building
               | the AI are making enough profit to keep it running.
               | 
               | With regards to the VCs, well, how many VCs are there in
               | the world? How many of the people who have something good
               | to say about AI are likely VCs? I might be off by an
               | order of magnitude, but even then it would really not be
               | driving the discussion.
        
               | leshow wrote:
               | I don't find that a compelling argument, lots of people
               | get taken in by hype cycles even when they don't profit
               | directly from it.
        
               | leshow wrote:
               | I think you're missing the broader context. There is a
               | lot of people very invested in the maximalist outcome
               | which does create pressure for people to be boosters. You
               | don't need a digital token for that to happen. There's a
               | social media aspect as well that creates a feedback loop
               | about claims.
               | 
               | We're in a hype cycle, and it means we should be extra
               | critical when evaluating the tech so we don't get taken
               | in by exaggerated claims.
        
               | jstummbillig wrote:
               | I mostly don't agree. Yes, there is always social
               | pressure with these things, and we are in a hype cycle,
               | but the people "buying in" are simply not doing much at
               | all. They are mostly consumers, waiting for the next
               | model, which they have no control over or stake in
               | creating (by and large).
               | 
               | The people _not_ buying into the hype, on the other
               | hands, are actually the ones that have a very good reason
               | to be invested, because if they turn out to be wrong they
               | might face some very uncomfortable adjustments in the job
               | landscape and a lot of the skills that they worked so
               | hard to gain and believed to be valuable.
               | 
               | As always, be weary of any claims, but the tension here
               | is very much the reverse of crypto and I don't think
               | that's very appreciated.
        
               | cfst wrote:
               | The current batch of models, specifically Claude Sonnet
               | and Opus 4, are the first I've used that have actually
               | been more helpful than annoying on the large mixed-
               | language codebases I work in. I suspect that dividing
               | line differs greatly between developers and applications.
        
               | nalllar wrote:
               | If you interact with internet comments and discussions as
               | an amorphous blob of people you'll see a constant trickle
               | of the view that models now are useful, and before were
               | useless.
               | 
               | If you pay attention to who says it, you'll find that
               | people have different personal thresholds for finding
               | llms useful, not that any given person like steveklabnik
               | above keeps flip-flopping on their view.
               | 
               | This is a variant on the goomba fallacy:
               | https://englishinprogress.net/gen-z-slang/goomba-fallacy-
               | exp...
        
               | bix6 wrote:
               | Everything actually got better. Look at the image
               | generation improvements as an easily visible benchmark.
               | 
               | I do not program for my day job and I vibe coded two
               | different web projects. One in twenty mins as a test with
               | cloudflare deployment having never used cloudflare and
               | one in a week over vacation (and then fixed a deep safari
               | bug two weeks later by hammering the LLM). These tools
               | massively raise the capabilities for sub-average people
               | like me and decrease the time / brain requirements
               | significantly.
               | 
               | I had to make a little update to reset the KV store on
               | cloudflare and the LLM did it in 20s after failing the
               | syntax twice. I would've spent at least a few minutes
               | looking it up otherwise.
        
               | Aeolun wrote:
               | It's true though? Previous models could do well in
               | specifically created settings. You can throw practically
               | everything at Opus, and it'll work mostly fine.
        
           | burnte wrote:
           | > Current LLMs just are not as good as they are sold to be as
           | a programming assistant and people consistently predict and
           | self-report in the wrong direction on how useful they are.
           | 
           | I would argue you don't need the "as a programming assistant"
           | phrase as right now from my experience over the past 2 years,
           | literally every single AI tool is massively oversold as to
           | its utility. I've literally not seen a single one that
           | delivers on what it's billed as capable of.
           | 
           | They're useful, but right now they need a lot of handholding
           | and I don't have time for that. Too much fact checking. If I
           | want a tool I always have to double check, I was born with a
           | memory so I'm already good there. I don't want to have to
           | fact check my fact checker.
           | 
           | LLMs are great at small tasks. The larger the single task is,
           | or the more tasks you try to cram into one session, the worse
           | they fall apart.
        
           | atiedebee wrote:
           | Let me bring you a third (not necessarily true)
           | interpretation:
           | 
           | The developer who has experience using cursor saw a
           | productivity increase not because he became better at using
           | cursor, but because he became worse at _not_ using it.
        
             | card_zero wrote:
             | Or, one person in 16 has a particular personality, inclined
             | to LLM dependence.
        
               | runarberg wrote:
               | Invoking personality is to the behavioral science as
               | invoking God is to the natural sciences. One can explain
               | anything by appealing to personality, and as such it
               | explains nothing. Psychologists have been trying to make
               | sense of personality for over a century without much
               | success (the best efforts so far have been a five factor
               | model [Big 5] which has ultimately pretty minor
               | predictive value), which is why most behavioral
               | scientists have learned to simply leave personality to
               | the philosophers and concentrate on much simpler
               | theoretical framework.
               | 
               | A much simpler explanation is what your parent offered.
               | And to many behavioralists it is actually the same
               | explanation, as to a true scotsm... [ _cough_ ]
               | behavioralist personality is simply learned habits, so--
               | by Occam's razor--you should omit personality from your
               | model.
        
               | card_zero wrote:
               | Fair comment, but I'm not down with behavioralism, and
               | people have personalities, regrettably.
        
               | runarberg wrote:
               | This is still ultimately a research within the field of
               | the behavior sciences, and as such the laws of human
               | behavior apply, where behaviorism offers a far more
               | successful theoretical framework than personality
               | psychology.
               | 
               | Nobody is denying that people have personalities btw. Not
               | even true behavioralists do that, they simply argue from
               | reductionism that personality can be explained with
               | learning contingencies and the reinforcement history.
               | Very few people are true behavioralists these days
               | though, but within the behavior sciences, scientists are
               | much more likely to borrow missing factors (i.e. things
               | that learning contingencies fail to explain) from fields
               | such as cognitive science (or even further to
               | neuroscience) and (less often) social science.
               | 
               | What I am arguing here, however, is that the appeal to
               | personality is unnecessary when explaining behavior.
               | 
               | As for figuring out what personality is, that is still
               | within the realm of philosophy. Maybe cognitive science
               | will do a better job at explaining it than
               | psychometricians have done for the past century. I
               | certainly hope so, it would be nice to have a better
               | model of human behavior. But I think even if we could
               | explain personality, it still wouldn't help us here. At
               | best we would be in a similar situation as physics, where
               | one model can explain things traveling at the speed of
               | light, while another model can explain things at the sub-
               | atomic scale, but the two models cannot be applied
               | together.
        
               | cutemonster wrote:
               | Didn't they rather mean:
               | 
               | Developers' own skills might atrophy, when they don't
               | write that much code themselves, relying on AI instead.
               | 
               | And now when comparing with/without AI they're faster
               | with. But a year ago they might have been that fast or
               | faster _without_ an AI.
               | 
               | I'm not saying that that's how things are. Just pointing
               | out another way to interpret what GP said
        
           | robwwilliams wrote:
           | Or a sampling artifact. 4 vs 12 does seem significant within
           | a study, but consider a set of N such studies.
           | 
           | I assume that many large companies have tested efficiency
           | gains and losses of there programmers much more extensively
           | than the authors of this tiny study.
           | 
           | A survey of companies and their evaluation and conclusions
           | would carry more weight---excluding companies selling AI
           | products, of course.
        
             | rs186 wrote:
             | If you use binomial test, P(X<=4) is about 0.105 which
             | means p = 0.21.
        
         | bgwalter wrote:
         | We have heard variations of that narrative for at least a year
         | now. It is not hard to use these chatbots and no one who was
         | very productive in open source before "AI" has any higher
         | output now.
         | 
         | Most people who subscribe to that narrative have some
         | connection to "AI" money, but there might be some misguided
         | believers as well.
        
         | bc1000003 wrote:
         | "My intiution is that..." - AGREED.
         | 
         | I've found that there are a couple of things you need to do to
         | be very efficient.
         | 
         | - Maintain an architecture.md file (with AI assistance) that
         | answers many of the questions and clarifies a lot of the
         | ambiguity in the design and structure of the code.
         | 
         | - A bootstrap.md file(s) is also useful for a lot of tasks..
         | having the AI read it and start with a correct idea about the
         | subject is useful and a time saver for a variety of kinds of
         | tasks.
         | 
         | - Regularly asking the AI to refactor code, simplify it,
         | modularize it - this is what the experienced dev is for. VIBE
         | coding generally doesn't work as AI's tend to write messy non-
         | modular code unless you tell them otherwise. But if you review
         | code, ask for specific changes.. they happily comply.
         | 
         | - Read the code produced, and carefully review it. And notice
         | and address areas where there are issues, have the AI fix all
         | of these.
         | 
         | - Take over when there are editing tasks you can do more
         | efficiently.
         | 
         | - Structure the solution/architecture in ways that you know the
         | AI will work well with.. things it knows about.. it's general
         | sweet spots.
         | 
         | - Know when to stop using the AI and code it yourself..
         | particuarly when the AI has entered the confusion doom loop.
         | Wasting time trying to get the AI to figure out what it's never
         | going to is best used just fixing it yourself.
         | 
         | - Know when to just not ever try to use AI. Intuitively you
         | know there's just certain code you can't trust the AI to safely
         | work on. Don't be a fool and break your software.
         | 
         | ----
         | 
         | I've found there's no guarantee that AI assistance will speed
         | up any one project (and in some cases slow it down).. but
         | measured cross all tasks and projects, the benefits are pretty
         | substantial. That's probably others experience at this point
         | too.
        
         | ericmcer wrote:
         | Looking at the example tasks in the pdf ("Sentencize wrongly
         | splits sentence with multiple...") these look like really
         | discrete and well defined bug fixes. AI should smash tasks like
         | that so this is even less hopeful.
        
         | rafaelmn wrote:
         | >My personal theory is that getting a significant productivity
         | boost from LLM assistance and AI tools has a much steeper
         | learning curve than most people expect.
         | 
         | Are we are still selling the "you are an expert senior
         | developer" meme ? I can completely see how once you are working
         | on a mature codebase LLMs would only slow you down. Especially
         | one that was not created by an LLM and where you are the
         | expert.
        
           | bicx wrote:
           | I think it depends on the kind of work you're doing, but I
           | use it on mature codebases where I am the expert, and I
           | heavily delegate to Claude Code. By being knowledgeable of
           | the codebase, I know exactly how to specify a task I need
           | performed. I set it to work on one task, then I monitor it
           | while personally starting on other work.
           | 
           | I think LLMs shine when you need to write a higher volume of
           | code that extends a proven pattern, quickly explore
           | experiments that require a lot of boilerplate, or have
           | multiple smaller tasks that you can set multiple agents upon
           | to parallelize. I've also had success in using LLMs to do a
           | lot of external documentation research in order to integrate
           | findings into code.
           | 
           | If you are fine-tuning an algorithm or doing domain-expert-
           | level tweaks that require a lot of contextual input-output
           | expert analysis, then you're probably better off just coding
           | on your own.
           | 
           | Context engineering has been mentioned a lot lately, but it's
           | not a meme. It's the real trick to successful LLM agent
           | usage. Good context documentation, guides, and well-defined
           | processes (just like with a human intern) will mean the
           | difference between success and failure.
        
         | dmezzetti wrote:
         | I'm the developer of txtai, a fairly popular open-source
         | project. I don't use any AI-generated code and it's not
         | integrated into my workflows at the moment.
         | 
         | AI has a lot of potential but it's way over-hyped right now.
         | Listen to the people on the ground who are doing real work and
         | building real projects, none of them are over-hyping it. It's
         | mostly those who have tangentially used LLMs.
         | 
         | It's also not surprising that many in this thread are clinging
         | to a basic premise that it's 3 steps backwards to go 5 steps
         | forward. Perhaps that is true but I'll take the study at face
         | value, it seems very plausible to me.
        
         | mnky9800n wrote:
         | I feel like I get better at it as I use Claude code more
         | because I both understand its strength and weaknesses and also
         | understand what context it's usually missing. Like today I was
         | struggling to debug an issue and realised that Claude's idea of
         | a coordinate system was 90 degrees rotated from mine and thus
         | it was getting confused because I was confusing it.
        
           | throwawayoldie wrote:
           | One of the major findings is that people's perception--that
           | is, what it felt like--was incorrect.
        
         | devin wrote:
         | It seems really surprising to me that anyone would call 50
         | hours of experience a "high skill ceiling".
        
         | keeda wrote:
         | _> My personal theory is that getting a significant
         | productivity boost from LLM assistance and AI tools has a much
         | steeper learning curve than most people expect._
         | 
         | Yes, and I'll add that there is likely no single "golden
         | workflow" that works for everybody, and everybody needs to
         | figure it out for themselves. It took me _months_ to figure out
         | how to be effective with these tools, and I doubt my approach
         | will transfer over to others ' situations.
         | 
         | For instance, I'm working solo on smallish, research-y projects
         | and I had the freedom to structure my code and workflows in a
         | way that works best for me and the AI. Briefly: I follow an ad-
         | hoc, pair-programming paradigm, fluidly switching between
         | manual coding and AI-codegen depending on an instinctive
         | evaluation of whether a prompt would be faster. This rapid
         | manual-vs-prompt assessment is second nature to me now, but it
         | took me a while to build that muscle.
         | 
         | I've not worked with coding agents, but I doubt this approach
         | will transfer over well to them.
         | 
         | I've said it before, but this is technology that behaves like
         | people, and so you have to approach it like working with a
         | colleague, with all their quirks and fallibilities and
         | potentially-unbound capabilities, rather than a deterministic,
         | single-purpose tool.
         | 
         | I'd love to see a follow-up of the study where they let the
         | same developers get more familiar with AI-assisted coding for a
         | few months and repeat the experiment.
        
           | Filligree wrote:
           | > I've not worked with coding agents, but I doubt this
           | approach will transfer over well to them.
           | 
           | Actually, it works well so long as you tell them when you've
           | made a change. Claude gets confused if things randomly change
           | underneath it, but it has no trouble so long as you give it a
           | short explanation.
        
         | ummonk wrote:
         | Devil's advocate: it's also possible the one developer hasn't
         | become more productive with Cursor, but rather has atrophied
         | their non-AI productivity due to becoming reliant on Cursor.
        
         | thesz wrote:
         | > My personal theory is that getting a significant productivity
         | boost from LLM assistance and AI tools has a much steeper
         | learning curve than most people expect.
         | 
         | This is what I heard about strong type systems (especially
         | Haskell's) about 20-15 years ago.
         | 
         | "History does not repeat, but it rhymes."
         | 
         | If we rhyme "strong types will change the world" with "agentic
         | LLMs will change the world," what do we get?
         | 
         | My personal theory is that we will get the same: some people
         | will get modest-to-substantial benefits there, but changes in
         | the world will be small if noticeable at all.
        
           | ruszki wrote:
           | Maybe it depends on the task. I'm 100% sure, that if you
           | think that type system is a drawback, then you have never
           | code in a diverse, large codebase. Our 1.5 million LOC 30
           | years old monolith would be completely unmaintainable without
           | it. But seriously, anything without a formal type system
           | above 10 LOC after a few years is unmaintainable. An informal
           | is fine for a while, but not long for sure. On a 30 years old
           | code, basically every single informal rules are broken.
           | 
           | Also, my long experience is that even in PoC phase, using a
           | type system adds almost zero extra time... of course if you
           | know the type system, which should be trivial in any case
           | after you've seen a few.
        
           | leshow wrote:
           | I don't think that's a fair comparison. Type systems don't
           | produce probabilistic output. Their entire purpose is to
           | reduce the scope of possible errors you can write. They kind
           | of did change the world, didn't they? I mean, not everyone is
           | writing Haskell but Rust exists and it's doing pretty well.
           | There was also not really a case to be made where type
           | systems made software in general _worse_. But you could
           | definitely make the case that LLM's might make software
           | worse.
        
             | atlintots wrote:
             | Its too bad the management people never pushed Haskell as
             | hard as they're pushing AI today! Alas.
        
         | Aurornis wrote:
         | > A quarter of the participants saw increased performance, 3/4
         | saw reduced performance.
         | 
         | The study used 246 tasks across 16 developers, for an average
         | of 15 tasks per developer. Divide that further in half because
         | tasks were assigned as AI or not-AI assisted, and the sample
         | size per developer is still relatively small. Someone would
         | have to take the time to review the statistics, but I don't
         | think this is a case where you can start inferring that the
         | developers who benefited from AI were just better at using AI
         | tools than those who were not.
         | 
         | I do agree that it would be interesting to repeat a similar
         | test on developers who have more AI tool assistance, but then
         | there is a potential confounding effect that AI-enthusiastic
         | developers could actually lose some of their practice in
         | writing code without the tools.
        
         | th0ma5 wrote:
         | Simon's opinion is unsurprisingly that people need to read his
         | blog and spam on every story on HN lest we be left behind.
        
         | eightysixfour wrote:
         | I have been teaching people at my company how to use AI code
         | tools, the learning curve is way worse for developers and I
         | have had to come up with some exercises to try and breakthrough
         | the curve. Some seemingly can't get it.
         | 
         | The short version is that devs want to give instructions
         | instead of ask for what outcome they want. When it doesn't
         | follow the instructions, they double down by being more
         | precise, the worst thing you can do. When non devs don't get
         | what they want, they add more detail to the description of the
         | desired outcome.
         | 
         | Once you get past the control problem, then you have a second
         | set of issues for devs where the things that should be easy or
         | hard don't necessarily map to their mental model of what is
         | easy or hard, so they get frustrated with the LLM when it can't
         | do something "easy."
         | 
         | Lastly, devs keep a shit load of context in their head - the
         | project, what they are working on, application state, etc. and
         | they need to do that for LLMs too, but you have to repeat
         | themselves often and "be" the external memory for the LLM. Most
         | devs I have taught hate that, they actually would rather have
         | it the other way around where they get help with context and
         | state but want to instruct the computer on their own.
         | 
         | Interestingly, the best AI assisted devs have often moved to
         | management/solution architecture, and they find the AI code
         | tools brought back some of the love of coding. I have a
         | hypothesis they're wired a bit differently and their role with
         | AI tools is actually closer to management than it is
         | development in a number of ways.
        
         | heavyset_go wrote:
         | Any "tricks" you learn for one model may not be applicable to
         | another, it isn't a given that previous experience with a
         | company's product will increase the likelihood of productivity
         | increases. When models change out from under you, the
         | heuristics you've built up might be useless.
        
       | inetknght wrote:
       | > _We pay developers $150 /hr as compensation for their
       | participation in the study._
       | 
       | Can someone point me to these 300k/yr jobs?
        
         | recursive wrote:
         | These are not 300k/yr jobs.
        
         | akavi wrote:
         | L5 ("Senior") at any FAANG co, L6 ("Staff") at pretty much any
         | VC-backed startup in the bay.
        
       | nestorD wrote:
       | One thing I could not find on a cursory read is how used were
       | those developers to AI tools. I would expect someone using those
       | regularly to benefit while someone who only played with them a
       | couple of time would likely be slowed down as they deal with the
       | friction of learning to be productive with the tool.
        
         | uludag wrote:
         | In this case though you still wouldn't necessarily know if the
         | AI tools had a positive causal effect. For example, I
         | practically live in Emacs. Take that away and no doubt I would
         | be immensely less effective. That Emacs improves my
         | productivity and without it I am much worse in no way implies
         | that Emacs is better than the alternatives.
         | 
         | I feel like a proper study for this would involve following
         | multiple developers over time, tracking how their contribution
         | patterns and social standing changes. For example, take three
         | cohorts of relatively new developers: instruct one to go all in
         | on agentic development, one to freely use AI tools, and one
         | prohibited from AI tools. Then teach these developers open
         | source (like a course off of this book:
         | https://pragprog.com/titles/a-vbopens/forge-your-future-
         | with...) and have them work for a year to become part of a
         | project of their choosing. Then in the end, track a number of
         | metrics such as leadership position in community, coding/non-
         | coding contributions, emotional connection to project, social
         | connections made with community, knowledge of code base, etc.
         | 
         | Personally, my prior probability is that the no-ai group would
         | likely still be ahead overall.
        
       | swayvil wrote:
       | AI by design can only repeat and recombine past material.
       | Therefore actual invention is out.
        
         | elpakal wrote:
         | underrated comment
        
         | atleastoptimal wrote:
         | HN moment
        
         | luibelgo wrote:
         | Is that actually proven?
        
           | greenchair wrote:
           | The easiest way to see this for yourself is with an image
           | generator. Try asking for a very specific combination of
           | things that would not normally appear together in an
           | artpiece.
        
         | keeda wrote:
         | Pretty much all invention is novel combination of known
         | techniques. Anything that introduces a fundamental new
         | technique is usually in the realm of groundbreaking papers and
         | Nobel prizes.
        
       | zzzeek wrote:
       | As a project for work, I've been using Claude CLI all week to do
       | as many tasks as possible. So with my week's experience, I'm now
       | an expert in this subject and can weigh in.
       | 
       | Two things that stand out to me are 1. it depends a lot on what
       | kind of task you are having the LLM do. and 2. if the LLM process
       | takes more time, _it is very likely your cognitive effort was
       | still way less_ - for sysadmin kinds of tasks, working with less
       | often accessed systems, LLMs can read --help, man pages, doc
       | sites, all for you, and give you the working command right there
       | (And then run it, and then look at the output and tell you why it
       | failed, or how it worked, and what it did). There is absolutely
       | no question that second part is a big deal. Sticking it onto my
       | large open source project to fix a deep, esoteric issue or write
       | some subtle documentation where it doesnt really  "get" what I'm
       | doing, yeah it is not as productive in that realm and you might
       | want to skip it for the thinking part there. I think everyone is
       | trying to figure out this question of "when and how" for LLMs. I
       | think the sweet spot is for tasks involving systems and
       | technologies where you'd otherwise be spending a lot of time
       | googling, stackoverflowing, reading man pages to get just the
       | right parameters into commands and so forth. This is cognitive
       | grunt work and the LLMs can do that part very well.
       | 
       | My week of effort with it was not really "coding on my open
       | source project"; two examples were, 1. running a bunch of ansible
       | playbooks that I wrote years ago on a new host, where OS upgrades
       | had lots of snags; I worked with Claude to debug all the various
       | error messages and places where the newer OS distribution had
       | different packages, missing packages, etc. it was ENORMOUSLY
       | helpful since I never look at these playbooks and I dont even
       | remember what I did, Claude can read it for you and interpret it
       | as well as you can. 2. I got a bugzilla for a fedora package that
       | I packaged years ago, where they have some change to the
       | directives used in specfiles that everyone has to make. I look at
       | fedora packaging workflows once every three years. I told Claude
       | to read the BZ and just do it. IT DID IT. I had to get involved
       | running the "mock" suite as it needed sudo but Claude gave me the
       | commands. _zero googling_. _zero even reading the new format of
       | the specfile_ (the bz linked to a tool that does the conversion).
       | From bug received to bug closed and I didnt do any typing at all
       | outside of the prompt. Had it done before breakfast since I didnt
       | even need any glucose for mental energy expended. This would have
       | been a painful and frustrating mental effort otherwise.
       | 
       | so the studies have to get more nuanced and survey a lot more
       | than 16 devs I think
        
       | geerlingguy wrote:
       | So far in my own hobby OSS projects, AI has only hampered things
       | as code generation/scaffolding is probably the least of my
       | concerns, whereas code review, community wrangling, etc. are more
       | impactful. And AI tooling can only do so much.
       | 
       | But it's hampered me in the fact that others, uninvited, toss an
       | AI code review tool at some of my open PRs, and that spits out a
       | 2-page document with cute emoji and formatted bullet points going
       | over all aspects of a 30 line PR.
       | 
       | Just adds to the noise, so now I spend time deleting or hiding
       | those comments in PRs, which means I have even _less_ time for
       | actual useful maintenance work. (Not that I have much already.)
        
       | heisenbit wrote:
       | AI sometimes points out hygiene issues that may be swept under
       | the carpet but once pointed out can't be ignored anymore. I know
       | I don't need that error handling, I'm certain for the near future
       | but maybe it is needed... Also the code produced by the AI has
       | some impedance match with my natural code. Then one needs to
       | figure out whether that is due to moving best practices, until
       | now ignored best practices or the AI being overwhelmed with code
       | from beginners. This all takes time - some of it is transient,
       | some of it is actually improving things and some of it is waste.
       | The jury is still out there.
        
       | ChrisMarshallNY wrote:
       | It's been _very_ helpful for me. I find ChatGPT the easiest to
       | use; not because it 's more accurate (it isn't), but because it
       | seems to understand the intent of my questions most clearly. I
       | don't usually have to iterate much.
       | 
       | I use it like a know-it-all personal assistant that I can ask any
       | question to; even [especially] the embarrassing, "stupid" ones.
       | 
       |  _> The only stupid question is the one we don 't ask.
       | 
       | - On an old art teacher's wall_
        
       | 0xmusubi wrote:
       | I find myself having discussions with AI about different design
       | possibilities and it sometimes comes up with ideas I hadn't
       | thought of or features I wasn't aware of. I wouldn't classify
       | this as "overuse" as I often find the discussions useful, even if
       | it's just to get my thoughts down. This might be more relevant
       | for larger scoped tasks or ones where the programmer isn't as
       | familiar with certain features or libraries though.
        
       | groos wrote:
       | One thing I've experienced in trying to use LLMs to code in an
       | existing large code base is that it's _extremely_ hard to
       | accurately describe what you want to do. Oftentimes, you are
       | working on a problem with a web of interactions all over the code
       | and describing the problem to an LLM will take far longer than
       | just doing it manually. This is not the case with generating new
       | (boilerplate) code for projects, which is where users report the
       | most favorable interaction with LLMs.
        
         | 9dev wrote:
         | That's my experience as well. It's where Knuth comes in again:
         | the program doesn't just live in the code, but also in the
         | minds of its creator. Unless I communicate all that context
         | from the start, I can't just dump years of concepts and
         | strategy out of my brain into the LLM without missing details
         | that would be relevant.
        
       | AvAn12 wrote:
       | N = 16 developers. Is this enough to draw any meaningful
       | conclusions?
        
         | sarchertech wrote:
         | That depends on the size of the effect you're trying to
         | measure. If cursor provides a 5x, 10x, or 100x productivity
         | boost as many people are claiming, you'd expect to see that in
         | a sample size of 16 unless there's something seriously wrong
         | with your sample selection.
         | 
         | If you are looking for a 0.1% increase in productivity, then 16
         | is too small.
        
           | biophysboy wrote:
           | Well it depends on the variance of the random variable
           | itself. You're right that with big, obvious effects, a larger
           | n is less "necessary". I could see individuals having very
           | different "productivities", especially when the idea is
           | flattened down to completion time.
        
           | AvAn12 wrote:
           | "A quarter of the participants saw increased performance, 3/4
           | saw reduced performance." So I think any conclusions drawn on
           | these 16 people doesn't signify much one way or the other.
           | Cool paper but how is this anything other than a null
           | finding?
        
       | atleastoptimal wrote:
       | I'm not surprised that AI doesn't help people with 5+ years
       | experience in open source contribution, but I'd imagine most
       | people aren't claiming AI tools are at senior engineer level yet.
       | 
       | Soon once the tools and how people use them improve AI won't be a
       | hinderance for advanced tasks like this, and soon after AI will
       | be able to do these prs on their own. It's inevitable given the
       | rate of improvement even since this study.
        
         | artee_49 wrote:
         | Even for senior levels the claim has been that AI will speed up
         | their coding (take it over) so they can focus on higher level
         | decisions and abstract level concepts. These contributions are
         | not those and based on prior predictions the productivity
         | should have gone up.
        
       | pera wrote:
       | Wow these are extremely interesting results, specially this part:
       | 
       | > _This gap between perception and reality is striking:
       | developers expected AI to speed them up by 24%, and even after
       | experiencing the slowdown, they still believed AI had sped them
       | up by 20%._
       | 
       | I wonder what could explain such large difference between
       | estimation/experience vs reality, any ideas?
       | 
       | Maybe our brains are measuring mental effort and distorting our
       | experience of time?
        
         | alfalfasprout wrote:
         | I would speculate that it's because there's been a huge
         | concerted effort to make people want to believe that these
         | tools are better than they are.
         | 
         | The "economic experts" and "ml experts" are in many cases
         | effectively the same group-- companies pushing AI coding tools
         | have a vested interest in people believing they're more useful
         | than they are. Executives take this at face value and broadly
         | promise major wins. Economic experts take this at face value
         | and use this for their forecasts.
         | 
         | This propagates further, and now novices and casual individuals
         | begin to believe in the hype. Eventually, as an experienced
         | engineer it moves the "baseline" expectation much higher.
         | 
         | Unfortunately this is very difficult to capture empirically.
        
         | longwave wrote:
         | I also wonder how many of the numerous AI proponents in HN
         | comments are subject to the same effect. Unless they are truly
         | measuring their own performance, is AI really making them more
         | productive?
        
         | chamomeal wrote:
         | It's funny cause I sometimes have the opposite experience. I
         | tried to use Claude code today to make a demo app to show off a
         | small library I'm working on. I needed it to set up some very
         | boilerplatey example app stuff.
         | 
         | It was fun to watch, it's super polished and sci-fi-esque. But
         | after 15 minutes I felt braindead and was bored out of my mind
         | lol
        
         | evanelias wrote:
         | Here's a scary thought, which I'm admittedly basing on
         | absolutely nothing scientific:
         | 
         | What if agentic coding sessions are triggering a similar
         | dopamine feedback loop as social media apps? Obviously not to
         | the same degree as social media apps, I mean coding for work is
         | still "work"... but there's maybe some similarity in getting
         | iterative solutions from the agent, triggering something in
         | your brain each time, yes?
         | 
         | If that was the case, wouldn't we expect developers to have an
         | overly positive perception of AI because they're literally
         | becoming addicted to it?
        
           | EarthLaunch wrote:
           | > The LLMentalist Effect: how chat-based Large Language
           | Models replicate the mechanisms of a psychic's con
           | 
           | https://softwarecrisis.dev/letters/llmentalist/
           | 
           | Plus there's a gambling mechanic: Push the button, sometimes
           | get things for free.
        
           | csherratt wrote:
           | That's my suspicion to.
           | 
           | My issue with this being a 'negative' thing is that I'm not
           | sure it is. It works off of the same hunting / foraging
           | instincts that keep us alive. If you feel addiction to
           | something positive, it is bad?
           | 
           | Social media is negative because it addicts you to mostly low
           | quality filler content. Content that doesn't challenge you.
           | You are reading shit posts instead of reading a book or doing
           | something with better for you in the long run.
           | 
           | One could argue that's true for AI, but I'm not confident
           | enough to make such a statement.
        
             | evanelias wrote:
             | The study found AI caused a "significant slowdown" in
             | developer efficiency though, so that doesn't seem positive!
        
       | afro88 wrote:
       | Early 2025. I imagine the results would be quite different with
       | mid 2025 models and tools.
        
       | gmaster1440 wrote:
       | What if the slowdown isn't a bug but a feature? What if AI tools
       | are forcing developers to think more carefully about their code,
       | making them slower but potentially producing better results?
       | AFAIK the study measured speed, not quality, maintainability, or
       | correctness.
       | 
       | The developers might feel more productive because they're
       | engaging with their code at a higher level of abstraction, even
       | if it takes longer. This would be consistent with why they
       | maintained positive perceptions despite the slowdown.
        
         | PessimalDecimal wrote:
         | In my experience, LLMs are not causing people to think more
         | carefully about their code.
        
       | doctoboggan wrote:
       | For me, the measurable gain in productiviy comes when I am
       | working with a new language or new technology. If I were to use
       | claude code to help implement a feature of a python library I've
       | worked on for years then I don't think it would help much (Maybe
       | even hurt). However, if I use claude code on some go code I have
       | very little experience with, or using it to write/modify helm
       | charts then I can definitely say it speeds me up.
       | 
       | But, taking a broader view its possible that these initial speed
       | ups are negated by the fact that I never really learn go or helm
       | charts as deeply now that I use claude code. Over time, its
       | possible that my net productiviy is still reduced. Hard to say
       | for sure, especially considering I might not have even considered
       | talking these more difficult go library modifications if I didn't
       | have claude code to hold my hand.
       | 
       | Regardless, these tools are out there, increasing in
       | effectiveness and I do feel like I need to jump on the train
       | before it leaves me at the station.
        
       | LegNeato wrote:
       | For certain tasks it can speed me up 30x compared to an expert in
       | the space: https://rust-gpu.github.io/blog/2025/06/24/vulkan-
       | shader-por...
        
         | lpghatguy wrote:
         | This is very disingenuous: we don't know how much spare time
         | Sascha spent, and much of that time was likely spent learning,
         | experimenting, and reporting issues to Slang.
        
       | _jayhack_ wrote:
       | This does not take into account the fact that experienced
       | developers working with AI have shifted into roles of management
       | and triage, working on several tasks simultaneously.
       | 
       | Would be interesting (and in fact necessary to derive conclusions
       | from this study) to see aggregate number of tasks completed per
       | developer with AI augmentation. That is, if time per task has
       | gone up by 20% but we clear 2x as many tasks, that is a pretty
       | important caveat to the results published here
        
       | isoprophlex wrote:
       | Ed Zitron was 100% right. The mask is off and the AI subprime
       | crisis is coming. Reading TFA, it would be hilarious if the
       | bubble burst AND it turns out there's actually no value to be
       | had, at ANY price. I for one can't wait for this era of hype to
       | end. We'll see.
       | 
       |  _you 're addicted to the FEELING of productivity more than
       | actual productivity. even knowing this, even seeing the data,
       | even acknowledging the complete fuckery of it all, you're still
       | gonna use me. i'm still gonna exist. you're all still gonna
       | pretend this helps because the alternative is admitting you spent
       | billions of dollars on spicy autocomplete._
        
       | keerthiko wrote:
       | IME AI coding is excellent for one-off scripts, personal
       | automation tooling (I iterate on a tool to scrape receipts and
       | submit expenses for my specific needs) and generally stuff that
       | can be run in environments where the creator and the end user are
       | effectively the same (and only) entity.
       | 
       | Scaled up slightly, we use it to build plenty of internal tooling
       | in our video content production pipeline (syncing between
       | encoding tools and a status dashboard for our non-technical
       | content team).
       | 
       | Using it for anything more than boilerplate code, well-defined
       | but tedious refactors, or quickly demonstrating how to use an
       | unfamiliar API in production code, before a human takes a full
       | pass at everything is something I'm going to be wary of for a
       | long time.
        
       | mrwaffle wrote:
       | My overall concern has to do with our developer ecosystem from
       | the important points mentioned by simonw and narush. I've been
       | concerned about this for years but AI reliance seems to be
       | pouring jet fuel on the fire. Particularly troubling is the lack
       | of understanding less-experienced devs will have over time. Does
       | anyone have a counter-argument for this they can share on why
       | this is a good thing?
        
         | partdavid wrote:
         | The shallow analogy is like "why worry about not being able to
         | do arithmetic without a calculator"? Like... the dev of the
         | future just won't need it.
         | 
         | I feel like programming has become increasingly specialized and
         | even before AI tool explosion, it's way more possible to be
         | ignorant of an enormous amount of "computing" than it used to
         | be. I feel like a lot of "full stack" developers only
         | understand things to the margin of their frameworks but above
         | and below it they kind of barely know how a computer works or
         | what different wire protocols actually are or what an OS might
         | actually _do_ at a lower level. Let alone the context in which
         | in application sits beyond let 's say, a level above a
         | kubernetes pod and a kind of trial-end-error approach to poking
         | at some YAML templates.
         | 
         | Do we all need to know about processor architectures and
         | microcode and L2 caches and paging and OS distributions and
         | system software and installers and openssl engines and how to
         | make sure you have the one that uses native instructions and
         | TCP packets and envoy and controllers and raft systems and
         | topic partitions and cloud IAM and CDN and DNS? Since that's
         | not the case--nearly everyone has vast areas of ignorance yet
         | still does a bunch of stuff--it's harder to sell the idea that
         | whatever AI tools are doing that we lose skills in will somehow
         | vaguely matter in the future.
         | 
         | I kind of miss when you had to know a little of everything and
         | it also seemed like "a little bit" was a bigger slice of what
         | there was to know. Now you talk to people who use a different
         | framework in your own language and you feel like you're talking
         | to deep specialists whose concerns you can barely understand
         | the existence of, let alone have an opinion on.
        
       | OpenSourceWard wrote:
       | Very cool work! And I love the nuance in your methodology and
       | findings. Anyway, I'm preparing myself for all the "Bombshell
       | news: AI is slowing down developers" posts that are coming.
        
       | asciimov wrote:
       | I'll be curious of the long term impacts of AI.
       | 
       | Such as: do you end up spending more time to find and fix issues,
       | does AI use reduce institutional knowledge, will you be more
       | inclined to start projects over from scratch.
        
       | lmeyerov wrote:
       | As someone has been doing hardcore genai for 2+ years, my
       | experience has been, and what we advise internally:
       | 
       | * 3 weeks to transition from ai pairing to AI Delegation to ai
       | multitasking. So work gains are mostly week 3+. That's 120+ hours
       | in, as someone pretty senior here.
       | 
       | * Speedup is the wrong metric. Think throughput, not latency.
       | Some finite amount of work might take longer, but the volume of
       | work should go up because AI can do more on a task and diff
       | tasks/projects in parallel.
       | 
       | Both perspectives seem consistent with the paper description...
        
       | ieie3366 wrote:
       | LLMs are godtier if you know what you're doing, and prompt them
       | with "do X", where x is a SELF-CONTAINED change you would
       | manually know how to implement
       | 
       | For example, today I asked claude to implement per-user rate-
       | limiting into my nestjs service, then iterated by asking
       | implementing specific unit tests and some refactoring. It one-
       | shot everything. I would say 90% time savings.
       | 
       | Unskilled people ask them "i have giant problem X solve it" and
       | end up with slop
        
       | thepasswordis wrote:
       | I actually think that pasting questions into chatGPT etc. and
       | then getting general answers to put into your code is the way.
       | 
       | "One shotting" apps, or even cursor and so forth seem like a
       | waste of time. It feels like if you prompt it _just right_ it
       | might help but then it never really does.
        
         | partdavid wrote:
         | I've done okay with copilot as a very smart autocomplete on: a)
         | very typical codebase, with b) lots of boilerplate, where c)
         | I'm not terribly familiar with the languages and frameworks,
         | which are d) very, very popular but e) I don't really like, so
         | I'm not particularly motivated to become familiar with them.
         | I'm not a frontend developer, I don't like it, but I'm in a
         | position now where I need to do frontend things with a verbose
         | Typescript/React application which is not interesting from a
         | technical point of view (good product, it's just not good
         | because it has an interesting or demanding front end). Copilot
         | (I use Emacs, so cursor is a non-starter, but copilot-mode
         | works very well for Typescript) has been pretty invaluable to
         | just sort of slogging through stuff.
         | 
         | For everything else, I think you're right, and actually the
         | dialog-oriented method is way better. If I learn an approach
         | and apply some general example from ChatGPT, but I do the
         | typing and implementation myself so I need to understand what
         | I'm doing, I'm actually leveling up and I know what I'm
         | finished with. If I weren't "experienced", I'd worry about what
         | it was doing to my critical thinking skills, but I know enough
         | about learning on my own at this point to know I'm doing
         | something.
         | 
         | I'm not interested in vibe coding at all--it seems like a one-
         | way process to automate what was already not the hard part of
         | software engineering; generating tutorial-level initial
         | implementations. Just more scaffolding that eventually needs to
         | be cleared away.
        
       | thesz wrote:
       | What is interesting here is that all predictions were positive,
       | but results are negative.
       | 
       | This shows that everyone in the study (economic experts, ML
       | experts and even developers themselves, even after getting
       | experience) are novices if we look at them from the Dunning-
       | Kruger effect [1] perspective.
       | 
       | [1] https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
       | 
       | "The Dunning-Kruger effect is a cognitive bias in which people
       | with limited competence in a particular domain overestimate their
       | abilities."
        
       | mattl wrote:
       | I don't understand how anyone doing open source can use something
       | trained on other people's code as a tool for contributions.
       | 
       | I wouldn't accept someone's copy and pasted code from another
       | project if it were under an incompatible license, let alone
       | something with unknown origin.
        
       | AIorNot wrote:
       | Hey guys why are we making it so complicated? do we really need a
       | paper and study?
       | 
       | anyway -AI as the tech currently stand is a new skill to use and
       | takes us humans time to learn, but once we do well, its becomes
       | force multiplier
       | 
       | ie see this:
       | https://claude.ai/public/artifacts/221821f0-0677-409b-8294-3...
        
       | tarofchaos wrote:
       | Totally flawed study
        
       | bit1993 wrote:
       | It used to be that all you required to program was a computer and
       | to RTFM but now we need to pay for API "tokens" and pray that
       | there are no rug pull in the future.
        
       | cadamsdotcom wrote:
       | My hot take: Cursor is a bad tool for agentic coding. Had a
       | subscription and canceled it in favor of Claude Code. I don't
       | want to spend 90% of my time approving every line the agent wants
       | to write. With Claude Code I review whole diffs - 1-2 minutes of
       | the agent's work at a time. Then I work with the agent at a level
       | of what its approach is, almost never asking about specific lines
       | of code. I can look at 5 files at once in git diff and then ask
       | "why'd you choose that way?" "Can we undo that and try to find a
       | simpler way?"
       | 
       | Cursor's workflow exposes how differently different people track
       | context. The best ways to work with Cursor may simply not work
       | for some of us.
       | 
       | If Cursor isn't working for you, I strongly encourage you to try
       | CLI agents like Claude Code.
        
       | codyb wrote:
       | So slow until a learning curve is hit (or as one user posited
       | "until you forget how to work without it").
       | 
       | But isn't the important thing to measure... how long does it take
       | to debug the resulting code at 3AM when you get a PagerDuty
       | alert?
       | 
       | Similarly... how about the quality of this code over time? It's
       | taken a lot of effort to bring some of the code bases I work in
       | into a more portable, less coupled, more concise state through
       | the hard work of
       | 
       | - bringing shared business logic up into shared folders
       | 
       | - working to ensure call chains flow top down towards root then
       | back up through exposed APIs from other modules as opposed to
       | criss-crossing through the directory structure
       | 
       | - working to separate business logic from API logic from display
       | logic
       | 
       | - working to provide encapsulation through the use of wrapper
       | functions creating portability
       | 
       | - using techniques like dependency injection to decouple concepts
       | allowing for easier testing
       | 
       | etc
       | 
       | So, do we end up with better code quality that ends up being more
       | maintainable, extensible, portable, and composable? Or do we just
       | end up with lots of poor quality code that eventually grows to
       | become a tangled mess we spend 50% of our time fighting bugs on?
        
       ___________________________________________________________________
       (page generated 2025-07-10 23:00 UTC)