hngopher.com

       [HN Gopher] I read all of Cloudflare's Claude-generated commits
       ___________________________________________________________________
        
       I read all of Cloudflare's Claude-generated commits
        
       Author : maxemitchell
       Score  : 201 points
       Date   : 2025-06-06 22:35 UTC (1 days ago)
        
 (HTM) web link (www.maxemitchell.com)
 (TXT) w3m dump (www.maxemitchell.com)
        
       | SupremumLimit wrote:
       | It's an interesting review but I really dislike this type of
       | techno-utopian determinism: "When models inevitably improve..."
       | Says who? How is it inevitable? What if they've actually reached
       | their limits by now?
        
         | Dylan16807 wrote:
         | Models are improving every day. People are figuring out
         | thousands of different optimizations to training and to
         | hardware efficiency. The idea that right now in early June 2025
         | is when improvement _stops_ beggars belief. We might be
         | approaching a limit, but that 's going to be a sigmoid curve,
         | not a sudden halt in advancement.
        
           | deadbabe wrote:
           | 5 years ago a person would be blown away by today's LLMs. But
           | people today will merely say "cool" at whatever LLMs are in
           | use 5 years from now. Or maybe not even that.
        
             | dingnuts wrote:
             | 5 years ago GPT2 was already outputting largely coherent
             | speech, there's been progress but it's not all that
             | shocking
        
             | tptacek wrote:
             | Most of the developers I know personally who have been
             | radicalized by coding agents, it happened within the past 9
             | months. It does not feel like we are in a phase of
             | predictable boring improvement.
        
               | keybored wrote:
               | Radicalized? Going with the flow and wishes of the people
               | who are driving AI is the opposite of that.
               | 
               | To have their minds changed drastically, sure..
        
               | tptacek wrote:
               | Sorry I have no idea what you're trying to say here.
        
               | lcnPylGDnU4H9OF wrote:
               | > very different from the usual or traditional
               | 
               | https://www.merriam-webster.com/dictionary/radical
               | 
               | Deciding that AI is going nowhere to suddenly deciding
               | that coding agents are how they will work going forward
               | is a radical change. That is what they meant.
        
               | keybored wrote:
               | Did you miss my second paragraph?
               | 
               | https://www.merriam-webster.com/dictionary/paragraph
        
               | Dylan16807 wrote:
               | Can you explain exactly what you meant by your second
               | paragraph? The ambiguity is why you got that reply.
               | 
               | If your second paragraph makes that reply irrelevant, are
               | you saying the meaning was "Your use of 'radicalized' is
               | technically correct but I still think you shouldn't have
               | used it here"?
        
             | dwaltrip wrote:
             | Bold prediction...
        
           | sitkack wrote:
           | It is copium that it will suddenly stop and the world they
           | knew before will return.
           | 
           | ChatGPT came out in Nov 2022. Attention Was All There Was in
           | 2017, we were already 5 years in the past. Or 5 years of
           | research to catch up to, and then from 2022 to now ... papers
           | and research have been increasing exponentially. Even in if
           | SOTA models were frozen, we still have years of research to
           | apply and optimize in various ways.
        
             | BoorishBears wrote:
             | I think it's equally copium that people keep assuming we're
             | just going to compound our way into intelligence that
             | generalizes enough to stop us from handholding the AI, as
             | much as I'd _genuinely_ enjoy that future.
             | 
             | Lately I spend all day post-training models for my product,
             | and I want to say 99% of the research specific to LLMs
             | doesn't reproduce and/or matter once you actually dig in.
             | 
             | We're getting exponentially more papers on the topics and
             | they're getting worse on average.
             | 
             | Every day there's a new paper claiming an X% gain by post-
             | training some ancient 8B parameter model and comparing it
             | to a bunch of other ancient models after they've overfitted
             | on the public dataset of a given benchmark and given the
             | model a best of 5.
             | 
             | And benchmarks won't ever show it, but even ChatGPT
             | 3.5-Turbo has better general world knowledge than a lot
             | models people consider "frontier" models today because
             | post-training makes it easy to cover up those gaps with
             | very impressive one-prompt outputs and strong benchmark
             | scores.
             | 
             | -
             | 
             | It feels like things are getting stuck in a local maxima:
             | we _are_ making forward progress, the models _are_ useful
             | and getting more useful, but the future people are
             | envisioning takes reaching a completely different goal post
             | that I 'm not at all convinced we're making exponential
             | progress towards.
             | 
             | There maybe exponential number of techniques claiming to be
             | ground breaking, but what has actually unlocked new
             | capabilities that can't just as easily be attributed to how
             | much more focused post-training has become on coding and
             | math?
             | 
             | Test time compute feels like the only one and we're already
             | seeing the cracks form in terms of its effect on
             | hallucinations, and there's a clear ceiling for the
             | performance the current iteration unlocks as all these
             | models are converging on pretty similar performance after
             | just a few model releases.
        
             | rxtexit wrote:
             | The copium is I think many people got comfortable post
             | financial crisis with nothing much changing or happening. I
             | think many people really liked a decade stretch with not
             | much more than web framework updates and smart phone
             | versioning.
             | 
             | We are just back on track.
             | 
             | I just read Oracular Programming: A Modular Foundation for
             | Building LLM-Enabled Software the other day.
             | 
             | We don't even have a new paradigm yet. I would be shocked
             | that in 10 years I don't look back at this time of writing
             | a prompt into a chatbot and then pasting the code into an
             | IDE as completely comical.
             | 
             | The most shocking thing to me is we are right back on track
             | to what I would have expected in 2000 for 2025. In 2019
             | those expectations seemed like science fiction delusions
             | after nothing happening for so long.
        
               | sitkack wrote:
               | Reading the Oracular paper now,
               | https://news.ycombinator.com/edit?id=44211588
               | 
               | It feels a bit like Halide, where the goal and the
               | strategy are separated so that each can be optimized
               | independently.
               | 
               | Those new paradigms are being discovered by hordes of
               | vibecoders, myself included. I am having wonderful
               | results with TDD and AI assisted design.
               | 
               | IDEs are now mostly browsers for code, and I no longer
               | copy and paste with a chatbot.
               | 
               | Curious what you think about the Oracular paper. One area
               | that I have been working on for the last couple weeks is
               | extracting ToT for the domain and then using the LLM to
               | generate an ensemble of exploration strategies over that
               | tree.
        
           | a2128 wrote:
           | I think at this point we're reaching more incremental
           | updates, which can score higher on some benchmarks but then
           | simultaneously behave worse with real-world prompts, most
           | especially if they were prompt engineered for a specific
           | model. I recall Google updating their Flash model on their
           | API with no way to revert to the old one and it caused a lot
           | of people to complain that everything they've built is no
           | longer working because the model is just behaving differently
           | than when they wrote all the prompts.
        
             | whbrown wrote:
             | Isn't it quite possible they replaced that Flash model with
             | a distilled version, saving money rather than increasing
             | quality? This just speaks to the value of open-weights more
             | than anything.
        
         | Sevii wrote:
         | Models have improved significantly over the last 3 months. Yet
         | people have been saying 'What if they've actually reached their
         | limits by now?' for pushing 3 years.
        
           | greyadept wrote:
           | For me, improvement means no hallucination, but that only
           | seems to have gotten worse and I'm interested to find out
           | whether it's actually solvable at all.
        
             | dymk wrote:
             | All the benchmarks would disagree with you
        
               | thuuuomas wrote:
               | Today's public benchmarks are yesterday's training data.
        
               | BoorishBears wrote:
               | The benchmarks also claim random 32B parameter models
               | beat Claude 4 at coding, so we know just how much they
               | matter.
               | 
               | It should be obvious to anyone who with a cursory
               | interest in model training, you can't trust benchmarks
               | unless they're fully private black-boxes.
               | 
               | If you can get even a _hint_ of the shape of the
               | questions on a benchmark, it 's trivial to synthesize
               | massive amounts of data that help you beat the benchmark.
               | And given the nature of funding right now, you're almost
               | silly _not_ to do it: it 's not cheating, it's
               | "demonstrably improving your performance at the
               | downstream task"
        
             | tptacek wrote:
             | Why do you care about hallucination for coding problems?
             | You're in an agent loop; the compiler is ground truth. If
             | the LLM hallucinates, the agent just iterates. You don't
             | even see it unless you make the mistake of looking closely.
        
               | kiitos wrote:
               | What on earth are you talking about??
               | 
               | If the LLM hallucinates, then the code it produces is
               | wrong. That wrong code isn't obviously or
               | programmatically determinable as wrong, the agent has no
               | way to figure out that it's wrong, it's not as if the LLM
               | produces at the same time tests that identify that
               | hallucinated code as being wrong. The only way that this
               | wrong code can be identified as wrong is by the human
               | user "looking closely" and figuring out that it is wrong.
               | 
               | You seem to have this fundamental belief that the code
               | that's produced by your LLM is valid and doesn't need to
               | be evaluated, line-by-line, by a human, before it can be
               | committed?? I have no idea how you came to this belief
               | but it certainly doesn't match my experience.
        
               | tptacek wrote:
               | No, what's happening here is we're talking past each
               | other.
               | 
               | An agent _lints and compiles code_. The LLM is stochastic
               | and unreliable. The agent is ~200 lines of Python code
               | that checks the exit code of the compiler and relays it
               | back to the LLM. You can easily fool an LLM. You can 't
               | fool the compiler.
               | 
               | I didn't say anything about whether code needs to be
               | reviewed line-by-line by humans. I review LLM code line-
               | by-line. Lots of code that compiles clean is nonetheless
               | horrible. But none of it includes hallucinated API calls.
               | 
               | Also, from where did this "you seem to have a fundamental
               | belief" stuff come from? You had like 35 words to go on.
        
               | someothherguyy wrote:
               | Linting isn't verification of correctness, and yes, you
               | can fool the compiler, linters, etc. Work with some human
               | interns, they are great at it. Agents will do crazy
               | things to get around linting errors, including removing
               | functionality.
        
               | fragmede wrote:
               | have you no tests?
        
               | kiitos wrote:
               | Irrelevant, really. Tests establish a minimum threshold
               | of acceptability, they don't (and can't) guarantee
               | anything like overall correctness.
        
               | tptacek wrote:
               | Just checking off the list of things you've determined to
               | be irrelevant. Compiler? Nope. Linter? Nope. Test suite?
               | Nope. How about TLA+ specifications?
        
               | skydhash wrote:
               | TLA+ specs don't verify code. They verify design. Such
               | design can be expressed in whatever, including pseudocode
               | (think algorithms notation in textbooks). Then you write
               | the TLA specs that will judge if invariants are truly
               | respected. Once you're sure of the design, you can go and
               | implement it, but there's no hard constraints like a type
               | system.
        
               | tptacek wrote:
               | At what level of formal methods verification does the
               | argument against AI-generated code fall apart? My
               | expectation is that the answer is "never".
               | 
               | The subtext is pretty obvious, I think: that standards,
               | on message boards, are being set for LLM-generated code
               | that are ludicrously higher than would be set for people-
               | generated code.
        
               | kiitos wrote:
               | I truly don't know what you're trying to communicate,
               | with all of your recent comments related to AI and LLM
               | and codegen and etc., the only thing I can guess is that
               | you're just cynically throwing sand into the wind. It's
               | unfortunate, your username used to carry some clout and
               | respect.
        
               | kiitos wrote:
               | > If the LLM hallucinates, then the code it produces is
               | wrong. That wrong code isn't obviously or
               | programmatically determinable as wrong, the agent has no
               | way to figure out that it's wrong, it's not as if the LLM
               | produces at the same time tests that identify that
               | hallucinated code as being wrong. The only way that this
               | wrong code can be identified as wrong is by the human
               | user "looking closely" and figuring out that it is wrong
               | 
               | The LLM can easily hallucinate code that will satisfy the
               | agent and the compiler but will still fail the actual
               | intent of the user.
               | 
               | > I review LLM code line-by-line. Lots of code that
               | compiles clean is nonetheless horrible.
               | 
               | Indeed _most_ code that LLMs generate compiles clean and
               | is nevertheless horrible! I 'm happy that you recognize
               | this truth, but the fact that you review that LLM-
               | generated code line-by-line makes you an extraordinary
               | exception vs. the normal user, who generates LLM code and
               | absolutely does not review it line-by-line.
               | 
               | > But none of [the LLM generated code] includes
               | hallucinated API calls.
               | 
               | Hallucinated API calls are just one of many many possible
               | kinds of hallucinated code that an LLM can generate, by
               | no means does "hallucinated code" describe only
               | "hallucinated API calls" -- !
        
               | tptacek wrote:
               | When you say "the LLM can easily hallucinate code that
               | will satisfy the compiler but still fail the actual
               | intent of the user", all you are saying is that the code
               | will have bugs. My code has bugs. So does yours. You
               | don't get to use the fancy word "hallucination" for
               | reasonable-looking, readable code that compiles and lints
               | but has bugs.
               | 
               | I think at this point our respective points have been
               | made, and we can wrap it up here.
        
               | someothherguyy wrote:
               | Hallucination is a fancy word?
               | 
               | The parent seems to be, in part, referring to "reward
               | hacking", which tends to be used as a super category to
               | what many refer to as slop, hallucination, cheating, and
               | so on.
               | 
               | https://courses.physics.illinois.edu/ece448/sp2025/slides
               | /le...
        
               | kiitos wrote:
               | > When you say "the LLM can easily hallucinate code that
               | will satisfy the compiler but still fail the actual
               | intent of the user", all you are saying is that the code
               | will have bugs. My code has bugs. So does yours. You
               | don't get to use the fancy word "hallucination" for
               | reasonable-looking, readable code that compiles and lints
               | but has bugs.
               | 
               | There is an obvious and categorical difference between
               | the "bugs" that an LLM produces as part of its generated
               | code, and the "bugs" that I produce as part of the code
               | that I write. You don't get to conflate these two classes
               | of bugs as though they are equivalent, or even
               | comparable. They aren't.
        
               | tptacek wrote:
               | They obviously are.
        
               | simonw wrote:
               | You seem to be using "hallucinate" to mean "makes
               | mistakes".
               | 
               | That's not how I use it. I see hallucination as a very
               | specific kind of mistake: one where the LLM outputs
               | something that is entirely fabricated, like a class
               | method that doesn't exist.
               | 
               | The agent compiler/linter loop can entirely eradicate
               | those. That doesn't mean the LLM won't make plenty of
               | other mistakes that don't qualify as hallucinations by
               | the definition I use!
               | 
               | It's newts and salamanders. Every newt is a salamander,
               | not every salamander is a newt. Every hallucination is a
               | mistake, not every mistake is a hallucination.
               | 
               | https://simonwillison.net/2025/Mar/2/hallucinations-in-
               | code/
        
               | kiitos wrote:
               | I'm not using "hallucinate" to mean "makes mistakes". I'm
               | using it to mean "code that is syntactically correct and
               | passes tests but is semantically incoherent". Which is
               | the same thing that "hallucination" normally means in the
               | context of a typical user LLM chat session.
        
               | saagarjha wrote:
               | My guy didn't you spend like half your life in the field
               | where your job was to sift through code that compiled but
               | nonetheless had bugs that you tried to exploit? How can
               | you possibly have this belief about AI generated code?
        
               | tptacek wrote:
               | I don't understand this question. Yes, I spent about 20
               | years learning the lesson that code is profoundly
               | knowable; to start with, you just read it. What challenge
               | do you believe AI-generated code presents to me?
        
               | lcnPylGDnU4H9OF wrote:
               | > You seem to have this fundamental belief that the code
               | that's produced by your LLM is valid and doesn't need to
               | be evaluated, line-by-line, by a human, before it can be
               | committed??
               | 
               | This is a mistaken understanding. The person you
               | responded to has written on these thoughts already and
               | they used memorable words in response to this proposal:
               | 
               | > Are you a vibe coding Youtuber? Can you not read code?
               | If so: astute point. Otherwise: what the fuck is wrong
               | with you?
               | 
               | It should be obvious that one would read and verify the
               | code before they commit it. Especially if one works on a
               | team.
               | 
               | https://fly.io/blog/youre-all-nuts/
        
               | kasey_junk wrote:
               | We should go one step past this and come up with an
               | industry practice where we get someone other than the
               | author to read the code before we merge it.
        
           | BoorishBears wrote:
           | This is just people talking past each other.
           | 
           | If you want a model that's getting better at helping you as a
           | tool (which for the record, I do), then you'd say in the last
           | 3 months things got better between Gemini's long context
           | performance, the return of Claude Opus, etc.
           | 
           | But if your goal post is replacing SWEs entirely... then it's
           | not hard to argue we definitely didn't overcome any new
           | foundational issues in the last 3 months, and not too many
           | were solved in the last 3 years even.
           | 
           | In the last year the only real _foundational_ breakthrough
           | would be RL-based reasoning w / test time compute delivering
           | real results, but what that does to hallucinations + even
           | Deepseek catching up with just a few months of post-training
           | shows in its current form, the technique doesn't completely
           | blow up any barriers that were standing the way people were
           | originally touting it.
           | 
           | Overall models are getting better at things we can trivially
           | post-train and synthesize examples for, but it doesn't feel
           | like we're breaking unsolved problems at a substantially
           | accelerated rate (yet.)
        
           | atomlib wrote:
           | https://xkcd.com/605/
        
         | groby_b wrote:
         | It is "inevitable" in the sense that in 99% of the cases,
         | tomorrow is just like yesterday.
         | 
         | LLMs have been continually improving for years now. The
         | surprising thing would be them not improving further. And if
         | you follow the research even remotely, you know they'll improve
         | for a while, because not all of the breakthroughs have landed
         | in commercial models yet.
         | 
         | It's not "techno-utopian determinism". It's a clearly visible
         | trajectory.
         | 
         | Meanwhile, if they didn't improve, it wouldn't make a
         | significant change to the overall observations. It's picking a
         | minor nit.
         | 
         | The observation that strict prompt adherence plus prompt
         | archival could shift how we program is both true, and it's a
         | phenomenon we observed several times in the past. Nobody keeps
         | the assembly output from the compiler around anymore, either.
         | 
         | There's definitely valid criticism to the passage, and it's
         | overly optimistic - in that most non-trivial prompts are still
         | underspecified and have multiple possible implementations, not
         | all correct. That's both a more useful criticism, and not tied
         | to LLM improvements at all.
        
           | double0jimb0 wrote:
           | Are there places that follow the research that speak to the
           | layperson?
        
         | sumedh wrote:
         | More compute mean more faster processing, more context.
        
         | its-kostya wrote:
         | What is ironic, if we buy in to the theory that AI will write
         | majority of the code in the next 5-10 years, what is it going
         | to train on after? ITSELF? Seems this theoretic trajectory of
         | "will inevitably get better" is is only true if humans are
         | producing quality training data. The quality of code LLMs
         | create is very well proportionate on how mature and ubiquitous
         | the langues/projects are.
        
           | solarwindy wrote:
           | I think you neatly summarise why the current pre-trained LLM
           | paradigm is a dead end. If these models were really capable
           | of artificial reasoning and _learning_ , they wouldn't need
           | more training data at all. If they could learn like a human
           | junior does, and actually progress to being a senior, then I
           | really could believe that we'll all be out of a job--but they
           | just _do not_.
        
       | SrslyJosh wrote:
       | > Reading through these commits sparked an idea: what if we
       | treated prompts as the actual source code? Imagine version
       | control systems where you commit the prompts used to generate
       | features rather than the resulting implementation.
       | 
       | Please god, no, never do this. For one thing, why would you _not_
       | commit the generated source code when storage is essentially
       | free? That seems insane for multiple reasons.
       | 
       | > When models inevitably improve, you could connect the latest
       | version and regenerate the entire codebase with enhanced
       | capability.
       | 
       | How would you know if the code was better or worse if it was
       | never committed? How do you audit for security vulnerabilities or
       | debug with no source code?
        
         | Sevii wrote:
         | There are lots of reasons not to do it. But if LLMs get good
         | enough that it works consistently people will do it anyway.
        
           | minimaxir wrote:
           | What will people call it when coders rely on vibes even more
           | than vibe coding?
        
             | roywiggins wrote:
             | Haruspicy?
        
             | brookst wrote:
             | Writing specs
        
               | auggierose wrote:
               | Exactly my thought. This is just natural language as a
               | specification language.
        
               | kiitos wrote:
               | ...as an ambiguous and inadequately-specified
               | specification language.
        
         | rectang wrote:
         | >> _what if we treated prompts as the actual source code?_
         | 
         | You would not do this because: unlike programming languages,
         | natural languages are ambiguous and thus inadequate to fully
         | specify software.
        
           | a012 wrote:
           | Prompts are like story on the board, and like engineers,
           | depends on the understanding of the model the generated
           | source code can vary. Saying the prompts could be the actual
           | code is so wrong and dangerous thought
        
           | squillion wrote:
           | Exactly!
           | 
           | > this assumes models can achieve strict prompt adherence
           | 
           | What does strict adherence to an ambiguous prompt even mean?
           | It's like those people asking Babbage if his machine would
           | give the right answer when given the wrong figures. _I am not
           | able rightly to apprehend the kind of confusion of ideas that
           | could provoke such a proposition._
        
         | tayo42 wrote:
         | I'm pretty sure most people aren't doing "software engineering"
         | when they program. There's the whole world of WordPress and
         | dream Weaver like programing out there too where the
         | consequences of messing up aren't really important.
         | 
         | Llms can be configured to have deterministic output too
        
         | fastball wrote:
         | The idea as stated is a poor one, but a slight reshuffling and
         | it seems promising:
         | 
         | You generate code with LLMs. You write tests for this code,
         | either using LLMs or on your own. You of course commit your
         | actual code: it is required to actually run the program, after
         | all. However you also save the entire prompt chain somewhere.
         | Then (as stated in the article), when a much better model comes
         | along, you re-run that chain, presumably with prompting like
         | "create this project, focusing on efficiency" or "create this
         | project in Rust" or "create this project, focusing on
         | readability of the code". Then you run the tests against the
         | new codebase and if the suite passes you carry on, with a much
         | improved codebase. The theoretical benefit of this over just
         | giving your previously generated code to the LLM and saying
         | "improve the readability" is that the newer (better) LLM is not
         | burdened by the context of the "worse" decisions made by the
         | previous LLM.
         | 
         | Obviously it's not actually that simple, as tests don't catch
         | everything (tho with fuzz testing and complete coverage and
         | such they can catch most issues), but we programmers often
         | treat them as if they do, so it might still be a worthwhile
         | endeavor.
        
           | stingraycharles wrote:
           | Means the temperature should be set to 0 (which not every
           | provider supports) so that the output becomes entirely
           | deterministic. Right now with most models if you give the
           | same input prompt twice it will give two different solutions.
        
             | NitpickLawyer wrote:
             | Even at temp 0, you might get different answers, depending
             | on your inference engine. There might be hardware
             | differences, as well as software issues (e.g. vLLM
             | documents this, if you're using batching, you might get
             | different answers depending on where in the batch sequence
             | your query landed).
        
             | derwiki wrote:
             | Two years ago when I was working on this at a startup,
             | setting OAI models' temp to 0 still didn't make them
             | deterministic. Has that changed?
        
             | fastball wrote:
             | I would only care about more deterministic output if I was
             | repeating the same process with the same model, which is
             | not the point of the exercise.
        
             | weird-eye-issue wrote:
             | Claude Code already uses a temperature of 0 (just inspect
             | the requests) but it's not deterministic
             | 
             | Not to mention it also performs web searches, web fetching
             | etc which would also make it not deterministic
        
             | afiori wrote:
             | Do LLMs inference engines have a way to seed their
             | randomness? so tho have reproducible outputs with still
             | some variance if desired?
        
               | bavell wrote:
               | Yes, although it's not always exposed to the end user of
               | LLM providers.
        
             | singhrac wrote:
             | Production inference is not deterministic because of
             | sharding (i.e. parameter weights on several GPUs on the
             | same machine or MoE), timing-based kernel choices (e.g.
             | torch.backends.cudnn.benchmark), or batched routing in
             | MoEs. Probably best to host a small model yourself.
        
           | maxemitchell wrote:
           | Your rephrasing better encompasses my idea, and I should have
           | emphasized in the post that I do _not_ think this is a good
           | idea (nor possible) right now, it was more of a hand-wavy
           | "how could we rethink source control in a post-LLM world"
           | passing thought I had while reading through all the commits.
           | 
           | Clearly it struck a chord with a lot of the folks here
           | though, and it's awesome to read the discourse.
        
         | renewiltord wrote:
         | It's been a thing people have done for at least a year
         | https://github.com/i365dev/LetterDrop
        
         | gizmo686 wrote:
         | My work has involved a project that is almost entirely
         | generated code for over a decade. Not AI generated, the actual
         | work of the project is in creating the code generator.
         | 
         | One of the things we learned very quickly was that having
         | generated source code in the same repository as actual source
         | code was not sustainable. The nature of reviewing changes is
         | just too different between them.
         | 
         | Another thing we learned very quickly was that attempting to
         | generate code, then modify the result is not sustainable; nor
         | is aiming for a 100% generated code base. The end result of
         | that was that we had to significantly rearchitect the project
         | for us to essentially inject manually crafted code into
         | arbitrary places in the generated code.
         | 
         | Another thing we learned is that any change in the code
         | generator needs to have a feature flag, because _someone_ was
         | relying on the old behavior.
        
           | mschild wrote:
           | > One of the things we learned very quickly was that having
           | generated source code in the same repository as actual source
           | code was not sustainable.
           | 
           | Keeping a repository with the prompts, or other commands
           | separate is fine, but not committing the generated code at
           | all I find questionable at best.
        
             | djtango wrote:
             | I didn't read it as that - If I understood correctly,
             | generated code must be quarantined very tightly. And
             | inevitably you need to edit/override generated code and the
             | manner by which you alter it must go through some kind of
             | process so the alteration is auditable and can again be
             | clearly distinguished from generated code.
             | 
             | Tbh this all sounds very familiar and like classic data
             | management/admin systems for regular businesses. The only
             | difference is that the data is code and the admins are the
             | engineers themselves so the temptation to "just" change
             | things in place is too great. But I suspect it doesn't
             | scale and is hard to manage etc.
        
             | diggan wrote:
             | If you can 100% reproduce the same generated code from the
             | same prompts, even 5 years later, given the same versions
             | and everything then I'd say "Sure, go ahead and don't saved
             | the generated code, we can always regenerate it". As
             | someone who spent some time in frontend development, we've
             | been doing it like that for a long time with (MB+)
             | generated code, keeping it in scm just isn't feasible long-
             | term.
             | 
             | But given this is about LLMs, which people tend to run with
             | temperature>0, this is unlikely to be true, so then I'd
             | really urge anyone to actually store the results
             | (somewhere, maybe not in scm specifically) as otherwise you
             | won't have any idea about what the code was in the future.
        
               | overfeed wrote:
               | > If you can 100% reproduce the same generated code from
               | the same prompts, even 5 years later
               | 
               | Reproducible builds with deterministic stacks and local
               | compilers are far from solved. Throwing in LLM randomness
               | just makes for a spicier environment to not commit the
               | generated code.
        
             | saagarjha wrote:
             | I feel like using a compiler is in a sense a code generator
             | where you don't commit the actual output
        
               | mschild wrote:
               | Sure, but compilers are arguably idempotent. Same code
               | input, same output. LLMs certainly are not.
        
               | saagarjha wrote:
               | Yeah I fully agree (in the other comments here, no less)
               | I just think "I don't commit my code" to be a specific
               | mindset of what code actually is
        
               | lelanthran wrote:
               | > I feel like using a compiler is in a sense a code
               | generator where you don't commit the actual output
               | 
               | Compilers are deterministic. Given the same input you
               | always get the same output so there's no reason to store
               | the output. If you don't get the same output we call it a
               | compiler bug!
               | 
               | LLMs do not work this way.
               | 
               | (Aside: Am I the only one who feels that the entire AI
               | industry is predicated on replacing only development
               | positions? we're looking at, what, 100bn invested, with
               | almost no reduce in customer's operating costs other than
               | if the customer has developers).
        
               | Atotalnoob wrote:
               | LLMs CAN be deterministic. You can control the
               | temperature to get the same output repeatedly.
               | 
               | Although I don't really understand why you'd only want to
               | store prompts...
               | 
               | What if that model is no longer available?
        
               | saagarjha wrote:
               | They're typically not, since they typically rely on
               | operators that aren't (e.g. atomics).
        
               | cesarb wrote:
               | > Compilers are deterministic. Given the same input you
               | always get the same output
               | 
               | Except when they aren't. See for instance
               | https://gcc.gnu.org/onlinedocs/gcc-15.1.0/gcc/Developer-
               | Opti... or the __DATE__/__TIME__ macros.
        
               | lelanthran wrote:
               | From the link:
               | 
               | > You can use the -frandom-seed option to produce
               | reproducibly identical object files.
               | 
               | Deterministic.
               | 
               | Also, with regard to __DATE__/__TIME__ macros, those are
               | deterministic, because the current date and time are part
               | of the inputs.
        
               | tptacek wrote:
               | Why does it matter to you if the code generator is
               | deterministic? The _code_ is.
               | 
               | If LLM generation was like a Makefile step, part of your
               | build process, this concern would make a lot of sense.
               | But nobody, anywhere, does that.
        
           | cimi_ wrote:
           | I will guess that you are generating orders of magnitude more
           | lines of code with your software than people do when building
           | projects with LLMs - if this is true I don't think the
           | analogy holds.
        
           | saagarjha wrote:
           | I think the biggest difference here is that your code
           | generator is probably deterministic and you likely are able
           | to debug the results it produces rather than treating it like
           | a black box.
        
             | buu700 wrote:
             | Overloading of the term "generate" is probably creating
             | some confused ideas here. An LLM/agent is a lot more
             | similar to a human in terms of its transformation of input
             | into output than it is to a compiler or code generator.
             | 
             | I've been working on a recent project with heavy use of AI
             | (probably around 100 hours of long-running autonomous AI
             | sprints over the last few weeks), and if you tried to re-
             | run all of my prompts in order, even using the exact same
             | models with the exact same tooling, it would almost
             | certainly fall apart pretty quickly. After the first few, a
             | huge portion of the remaining prompts would be referencing
             | code that wouldn't exist and/or responding to things that
             | wouldn't have been said in the AI's responses. Meta-
             | prompting (prompting agents to prepare prompts for other
             | agents) would be an interesting challenge to properly
             | encode. And how would human code changes be represented, as
             | patches against code that also wouldn't exist?
             | 
             | The whole idea also ignores that AI being fast and cheap
             | compared to human developers doesn't make it infinitely
             | fast or free, or put it in the same league of quickness and
             | cheapness as a compiler. Even if this were conceptually
             | feasible, all it would really accomplish is making it so
             | that any new release of a major software project takes
             | weeks (or more) of build time and thousands of dollars (or
             | more) burned on compute.
             | 
             | It's an interesting thought experiment, but the way I would
             | put it into practice would be to use tooling that includes
             | all relevant prompts / chat logs in each commit message.
             | Then maybe in the future an agent with a more advanced
             | model could go through each commit in the history one by
             | one, take notes on how each change could have been better
             | implemented based on the associated commit message and any
             | source prompts contained therein, use those notes to inform
             | a consolidated set of recommended changes to the current
             | code, and then actually apply the recommendations in a
             | series of pull requests.
        
             | tptacek wrote:
             | People keep saying this and it doesn't make sense. I review
             | code. I don't construct a theory of mind of the author of
             | the code. With AI-generated code, if it isn't eminently
             | reviewable, I reflexively kill the PR and either try again
             | or change the tasking.
             | 
             | There's always this vibe that, like, AI code is like an
             | IOCCC puzzle. No. It's extremely boring mid-code. Any
             | competent developer can review it.
        
               | buu700 wrote:
               | I assumed they were describing AI itself as a black box
               | (contrasting it with deterministic code generation), not
               | the output of AI.
        
               | tptacek wrote:
               | Right, I get that, and an LLM call by itself clearly is a
               | black box. I just don't get why that's supposed to
               | matter. It produces an artifact I can (and must) verify
               | myself.
        
               | buu700 wrote:
               | Because if the LLM is a black box and its output must
               | ultimately be verified by humans, then you can't treat
               | conversion of prompts into code as a simple build step as
               | though an AI agent were just some sort of compiler. You
               | still need to persist the actual code in source control.
        
           | skywhopper wrote:
           | There's a huge difference between deterministic generated
           | code and LLM generated code. The latter will be different
           | every time, sometimes significantly so. Subsequent prompts
           | would almost immediately be useless. "You did X, but we want
           | Y" would just blow up if the next time through the LLM (or
           | the new model you're trying) doesn't produce X at all.
        
           | overfeed wrote:
           | > One of the things we learned very quickly was that having
           | generated source code in the same repository as actual source
           | code was not sustainable
           | 
           | My rule of the thumb is to have both in same repo, but treat
           | generated code like binary data. This was informed by when I
           | was burned by a tooling regression that broke the generated
           | code and the investigation was complicated by having to
           | correlate commits across different repositories
        
             | dkubb wrote:
             | I love having generated code in the same repo as the
             | generator because with every commit I can regenerate the
             | code and compare it to make sure it stays in sync. Then it
             | forms something similar to a golden tests where if
             | something unexpected changes it gets noticed on review.
        
           | david-gpu wrote:
           | Please tell us we company you are working for so that we
           | don't send our resumes there.
           | 
           | Jokes aside, I have worked in projects where auto-generating
           | code was the solution that was chosen and it's always been
           | 100% auto-generated, essentially at compilation time. Any
           | hand-coded stuff needed to handle corner cases or glue pieces
           | together was kept outside of the code generator.
        
         | mellosouls wrote:
         | Yes, it's too early to be doing that now, but if you see the
         | move to AI-assisted code as _at least_ the same magnitude of
         | change as the move from assembly to high level languages, the
         | argument makes more sense.
         | 
         | Nobody commits the compiled code; this is the direction we are
         | moving in, high level source code is the new assembly.
        
         | Xelbair wrote:
         | Worse. Models aren't deterministic! They use temperature value
         | to control randomness, just so they can escape local minima!
         | 
         | Regenerated code might behave differently, have different
         | bugs(worst case), or not work at all(best case).
        
           | chrishare wrote:
           | Nitpick - it's the ML system that is sampling from model
           | predictions that has a temperature parameter, not the model
           | itself. Temperature and even model aside, there are other
           | sources of randomness like the underlying hardware that can
           | cause the havoc you describe.
        
         | never_inline wrote:
         | Apart from obvious non-reproducibility, the other problem is
         | lack of navigable structure. I can't command+click or "show
         | usages" or "show definition" any more.
        
           | saagarjha wrote:
           | Just ask the AI for those obviously
        
         | visarga wrote:
         | The idea is good, but we should commit both documentation and
         | tests. They allow regenerating the code at will.
        
         | pollinations wrote:
         | I'd say commit a comprehensive testing system with the prompts.
         | 
         | Prompts are in a sense what higher level programming languages
         | were to assembly. Sure there is a crucial difference which is
         | reproducibility. I could try and write down my thoughts why I
         | think in the long run it won't be so problematic. I could be
         | wrong of course.
         | 
         | I run https://pollinations.ai which servers over 4 million
         | monthly active users quite reliably. It is mostly coded with
         | AI. Since about a year there was no significant human commit.
         | You can check the codebase. It's messy but not more messy than
         | my codebases were pre-LLMs.
         | 
         | I think prompts + tests in code will be the medium-term
         | solution. Humans will be spending more time testing different
         | architecture ideas and be involved in reviewing and larger
         | changes that involve significant changes to the tests.
        
           | maxemitchell wrote:
           | Agreed with the medium-term solution. I wish I put some more
           | detail into that part of the post, I have more thoughts on it
           | but didn't want to stray too far off topic.
        
         | 7speter wrote:
         | I think the author is saying you commit the prompt with the
         | resulting code. You said it yourself, storage is free, so
         | comment the prompt along with the output (don't comment that
         | out that if I'm not being clear); it would show the
         | developers(?) intent, and to some degree, almost always
         | contribute to the documentation process.
        
           | maxemitchell wrote:
           | Author here :). Right now, I think the pragmatic thing to do
           | is to include all prompts used in either the PR description
           | and/or in the commit description. This wouldn't make my
           | longshot idea of "regenerating a repo from the ground up"
           | possible, but it still adds very helpful context to code
           | reviewers and can help others on your team learn prompting
           | techniques.
        
         | kace91 wrote:
         | Plus, commits depend on the current state of the system.
         | 
         | What sense does "getting rid of vulnerabilities by phasing out
         | {dependency}" make, if the next generation of the code might
         | not rely on the mentioned library at all? What does "improve
         | performance of {method}" mean if the next generation used a
         | fully different implementation?
         | 
         | It makes no sense whatsoever except for a vibecoders script
         | that's being extrapolated into a codebase.
        
         | croes wrote:
         | You couldn't even tell in advance if the prompt produces code
         | at all.
        
       | js2 wrote:
       | Discussion from 4 days ago when the code was announced (846
       | points, 519 comments):
       | 
       | https://news.ycombinator.com/item?id=44159166
        
       | viraptor wrote:
       | The documentation angle is really good. I've noticed it with the
       | mdc files and llm.txt semi-standard. Documentation is often
       | treated as just extra cost and a chore. Now, good description of
       | the project structure and good examples suddenly becomes
       | something devs want ahead of time. Even if the reason is not
       | perfect, I appreciate this shift we'll all benefit from.
        
       | IncreasePosts wrote:
       | I asked this in the other thread (no response, but I was a bit
       | late)
       | 
       | How does anyone using AI like this have confidence that they
       | aren't unintentionally plagiarizing code and violating the terms
       | of whatever license it was released under?
       | 
       | For random personal projects I don't see it mattering that much.
       | But if a large corp is releasing code like this, one would hope
       | they've done some due diligence that they have to just stolen the
       | code from some similar repo on GitHub, laundered through a LLM.
       | 
       | The only section in the readme doesn't mention checking similar
       | projects or libraries for common code:
       | 
       | > Every line was thoroughly reviewed and cross-referenced with
       | relevant RFCs, by security experts with previous experience with
       | those RFCs.
        
         | saghm wrote:
         | Safety in the shadow of giant tech companies. People were upset
         | when Microsoft released Copilot trained on GitHub data, but
         | nobody who cared doing do anything about it, and nobody who
         | could have done something about it cared, so it just became the
         | new norm.
        
         | throwawaysleep wrote:
         | As an individual dev, I simply don't care. Not my problem.
         | 
         | Companies are satisfied with the idemnity provided by
         | Microsoft.
        
         | akdev1l wrote:
         | > How does anyone using AI like this have confidence that they
         | aren't unintentionally plagiarizing code and violating the
         | terms of whatever license it was released under?
         | 
         | They don't and no one cares
        
         | ryandrake wrote:
         | This is an excellent question that the AI-boosters always seem
         | to dance around. Three replies already are saying "Nobody
         | cares." Until they do. I'd be willing to bet that some time in
         | the near future, some big company is going to care _a lot_ and
         | that there will be a landmark lawsuit that significantly
         | changes the LLM landscape. Regulation or a judge is going to
         | eventually decide the extent to which someone can use AI to
         | copy someone else's IP, and it's not going to be pretty.
        
           | SpicyLemonZest wrote:
           | It just presumes a level of fixation in copyright law that I
           | don't think is realistic. There was a landmark lawsuit MAI v.
           | Peak Computer in 1993, where judges determined that repairing
           | a computer without the permission of the operating system's
           | author is copyright infringement, and it didn't change the
           | landscape at all because everyone immediately realized it's
           | not practical for things to work that way. There's no
           | realistic world where AI tools end up being extremely useful
           | but nobody uses them because of a court ruling.
        
         | tptacek wrote:
         | Most of the code generated by LLMs, and _especially_ the code
         | you actually keep from an agent, is mid, replacement-level,
         | boring stuff. If you 're not already building projects with
         | LLMs, I think you need to start doing that first before you
         | develop a strong take on this. From what I see in my own work,
         | the code being generated is highly unlikely to be
         | distinguishable. There is more of me and my prompts and
         | decisions in the LLM code than there can possibly be defensible
         | IPR from anybody else, unless the very notion of, like,
         | wrapping a SQLite INSERT statement in Golang is defensible.
         | 
         | The best way I can explain the experience of working with an
         | LLM agent right now is that it is like if every API in the
         | world had a magic "examples" generator that always included
         | whatever it was you were trying to do (so long as what you were
         | trying to do was within the obvious remit of the library).
        
         | aryehof wrote:
         | The consensus for right or wrong, is that LLM produced code
         | (unless repeated verbatim) is equivalent to you or I
         | legitimately stating our novel understanding of mixed sources
         | some of which may be copyrighted.
        
         | simonw wrote:
         | All of the big LLM vendors have a "copyright shield" indemnity
         | clause for their paying customers - a guarantee that if you get
         | sued over IP for output from their models their legal team will
         | step in to fight on your behalf.
        
         | kentonv wrote:
         | I'm fairly confident that it's not just plagiarizing because I
         | asked the LLM to implement a novel interface with unusual
         | semantics. I then prompted for many specific fine-grain changes
         | to implement features the way I wanted. It seems entirely
         | implausible to me that there could exist prior art that
         | happened to be structured exactly the way I requested.
         | 
         | Note that I came into this project believing that LLMs were
         | plagiarism engines -- I was looking for that! I ended up
         | concluding that this view was not consistent with the output I
         | was actually seeing.
        
         | cavisne wrote:
         | Some API's (Gemini at least) run a search on their outputs to
         | see if the model is reciting data from training.
         | 
         | So for direct copies like what you are talking about that would
         | be picked up.
         | 
         | For copying concepts from other libraries, seems like a problem
         | with or without LLM's.
        
       | drodgers wrote:
       | > Prompts as Source Code
       | 
       | Another way to phrase this is LLM-as-compiler and Python (or
       | whatever) as an intermediate compiler artefact.
       | 
       | Finally, a true 6th generation programming language!
       | 
       | I've considered building a toy of this with really aggressive
       | modularisation of the output code (eg. python) and a query-based
       | caching system so that each module of code output only changes
       | when the relevant part of the prompt or upsteam modules change
       | (the generated code would be committed to source control like a
       | lockfile).
       | 
       | I think that (+ some sort of WASM encapsulated execution
       | environment) would one of the best ways to write one off things
       | like scripts which _don 't_ need to incrementally get better and
       | more robust over time in the way that ordinary code does.
        
         | sumedh wrote:
         | > Finally, a true 6th generation programming language!
         | 
         | Karpathy already said English is the new programming language.
        
       | declan_roberts wrote:
       | These posts are funny to me because prompt engineers point at
       | them as evidence of the fast-approaching software engineer
       | obsolescence but the years of experience in software engineering
       | necessary to even guide an AI in this way is very high.
       | 
       | The reason he keeps adjusting the prompts is because he knows how
       | to program. He knows what it should look like.
       | 
       | It just blurs the line between engineer and tool.
        
         | tptacek wrote:
         | I don't know why that's funny. This is not a post about a vibe
         | coding session. It's Kenton Varda['s coding session].
         | 
         |  _later_
         | 
         |  _updated to clarify kentonv didn 't write this article_
        
           | kevingadd wrote:
           | I think it makes sense that GP is skeptical of this article
           | considering it contains things like:
           | 
           | > this tool is improving itself, learning from every
           | interaction
           | 
           | which seem to indicate a fundamental misunderstanding of how
           | modern LLMs work: the 'improving' happens by humans
           | training/refining existing models offline to create new
           | models, and the 'learning' is just filling the context window
           | with more stuff, not enhancement of the actual model or the
           | model 'learning' - it will forget everything if you drop the
           | context and as the context grows it can 'forget' things it
           | previously 'learned'.
        
             | BurritoKing wrote:
             | When you consider the "tool" as more than just the LLM
             | model, but the stuff wrapped around calling that model then
             | I feel like you can make a good argument it's improving
             | when it keeps context in a file on disk and constantly
             | updates and edits that file as you work throguh the
             | project.
             | 
             | I do this routinely for large initiatives I'm kicking off
             | through Claude Code - it writes a long detailed plan into a
             | file and as we work through the project I have it
             | constantly updating and rewriting that document to add
             | information we have jointly discovered from each bit of the
             | work. That means every time I come back and fire it back
             | up, it's got more information than when it started, which
             | looks a lot more improvement from my perspective.
        
               | tptacek wrote:
               | I would love to hear more about this workflow.
        
           | kiitos wrote:
           | The sequence of commits talked about by the OP -- i.e.
           | kenton's coding session's commits -- are like one degree
           | removed from 100% pure vibe coding.
        
             | tptacek wrote:
             | Your claim here being that Kenton Varda isn't reading the
             | code he's generating. Got it. Good note.
        
               | kiitos wrote:
               | No, that's not at all my claim, as it's obvious from the
               | commit history that Kenton is reading the code he's
               | generating before committing it.
        
               | kentonv wrote:
               | What do you mean by "one degree removed from 100% pure
               | vibe coding", then? The definition of vibe coding is
               | letting the AI code without review...
        
               | kiitos wrote:
               | > one degree removed
               | 
               | You're letting Claude do your programming for you, and
               | then sweeping up whatever it does afterwards. Bluntly,
               | you're off-loading your cognition to the machine. If
               | that's fine by you then that's fine enough, it just means
               | that the quality of your work becomes a function of your
               | tooling rather than your capabilities.
        
               | kentonv wrote:
               | I don't agree. The AI largely does the boring and obvious
               | parts. I'm still deciding what gets built and how it is
               | designed, which is the interesting part.
        
               | tptacek wrote:
               | It's the same with me, with the added wrinkle of pulling
               | each PR branch down and refactoring things (and,
               | ironically, introducing my own bugs).
        
               | kiitos wrote:
               | > I'm still deciding what gets built and how it is
               | designed, which is the interesting part.
               | 
               | How, exactly? Do you think that you're "deciding what
               | gets built and how it's designed" by iterating on the
               | prompts that you feed to the LLM that generates the code?
               | 
               | Or are you saying that you're somehow able to write the
               | "interesting" code, and can instruct the LLM to generate
               | the "boring and obvious" code that needs to be filled-in
               | to make your interesting code work? (This is certainly
               | not what's indicated by your commit history, but, who
               | knows?)
        
         | spaceman_2020 wrote:
         | The argument is that this stuff will so radically improve
         | senior engineer productivity that the demand for junior
         | engineers will crater. And without a pipeline of junior
         | engineers, the junior-to-senior trajectory will radically
         | atrophy
         | 
         | Essentially, the field will get frozen where existing senior
         | engineers will be able to utilize AI to outship traditional
         | senior-junior teams, even as junior engineers fail to secure
         | employment
         | 
         | I don't think anything in this article counters this argument
        
           | tptacek wrote:
           | I don't know why people don't give more credence to the
           | argument that the exact opposite thing will happen.
        
             | dcre wrote:
             | Right. I don't understand why everyone thinks this will
             | make it impossible for junior devs to learn. The people I
             | had around to answer my questions when I was learning knew
             | a whole lot less than Claude and also had full time jobs
             | doing something other than answering my questions.
        
               | fch42 wrote:
               | It won't make it impossible for junior engineers to
               | learn.
               | 
               | It will simply reduce the amount of opportunities to
               | learn (and not just for juniors), by virtue of companies'
               | beancounters concluding "two for one" (several juniors)
               | doesn't return the same as "buy one get one free"
               | (existing staff + AI license).
               | 
               | I dread the day we all "learn from AI". The social
               | interaction part of learning is just as important as the
               | content of it, really, especially when you're young; none
               | of that comes across yet in the pure "1:1 interaction"
               | with AI.
        
               | auggierose wrote:
               | I learnt programming on my own, without any social
               | interaction involved. In fact, I loved programming
               | because it does not involve any social interaction.
               | 
               | Programming has become more of a "social game" in the
               | last 15 years or so. AI is a new superpower for people
               | like me, bringing balance to the Force.
        
               | delegate wrote:
               | You learn by doing.. eg typing the code. It's not just
               | knowledge, it's the intuition you develop when you write
               | code yourself. Just like physical exercise. Or playing an
               | instrument. It's not enough to know the theory, practice
               | is key.
               | 
               | AI makes it very easy to avoid typing and hence make
               | learning this skill less attractive.
               | 
               | But I don't necessarily see it as doom and gloom, what I
               | think will happen - juniors will develop advanced
               | intuition about using AI and getting the functionality
               | they need, not the quality of the code, while at the same
               | time the AI models will get increasingly better and write
               | higher quality code.
        
               | Ataraxic wrote:
               | Junior devs using AI can get a lot better at using AI and
               | learn those existing patterns it generates, but I notice,
               | for myself, that if I let AI write a lot of the code I
               | remember and thereby understand it later on less well.
               | This applies in school and when trying to learn new
               | things but the act of writing down the solution and
               | working out the details yourself trains our own brain.
               | I'd say that this has been a practice for over a thousand
               | years and I'm skeptical that this will make junior devs
               | grow their own skills faster.
               | 
               | I think asking questions to the AI for your own
               | understanding totally makes sense, but there is a benefit
               | when you actually create the code versus asking the AI to
               | do it.
        
               | tptacek wrote:
               | I'm sure there is when you're just getting your sea legs
               | in some environment, but at some point most of the code
               | you write in a given environment is rote. Rote code is
               | both depleting and mutagenic --- if you're fluent and
               | also interested in programming, you'll start convincing
               | yourself to do stupid stuff to make the code less rote
               | ("DRY it up", "make a DSL", &c) that makes your code less
               | readable and maintainable. It's a trap I fall into
               | constantly.
        
               | kiitos wrote:
               | > but at some point most of the code you write in a given
               | environment is rote
               | 
               | "Most of the code one writes in a given environment is
               | rote" is true in the same sense that most of the words
               | one writes in any given bit of text are rote e.g.
               | conjunctions, articles, prepositions, etc.
        
               | tptacek wrote:
               | Some writers I know are convinced this is true, but I
               | still don't think the comparison is completely apt,
               | because _deliberately_ rote code with _modulated_
               | expressiveness is often (even usually) a virtue in
               | coding, and not always so with writing. For experienced
               | or enthusiastic coders, that is to say, the effort is
               | often in not doing stupid stuff to make the code more
               | clever.
               | 
               | Straight-line replacement-grade mid code that just does
               | the things a prompt tells it to in the least clever most
               | straightforward way possible is usually a good thing;
               | that long clunky string of modifiers goes by the name
               | "maintainability".
        
             | spaceman_2020 wrote:
             | If a junior engineer ships a similar repo to this with the
             | help of AI, sure, I'll buy that.
             | 
             | But as of now, it's senior engineers who really know what
             | they 're doing who can spot the errors in AI code.
        
               | tptacek wrote:
               | Hold on. You said "really know what they're doing". Yes,
               | I agree with that. What I don't buy is the coupling of
               | that concept with "seniority".
        
               | danielbln wrote:
               | Have a better term for "knows what they're doing" other
               | than senior?
        
               | tptacek wrote:
               | That's not what "senior" means.
        
               | danielbln wrote:
               | Maybe you could enlighten the rest of us then. According
               | to your favorite definition, what does senior mean, what
               | does seniority mean, and what's a term for someone who
               | knows what they're doing?
        
               | tptacek wrote:
               | Seniority means you've held the role for a long time.
        
               | etothet wrote:
               | This is not necessarily true in practical terms when it
               | comes to hiring or promoting. Often a senior dev becomes
               | a senior because of having an advanced skillset, despite
               | years on the job. Similarily, often developers who have
               | been on the job for many years aren't ready for senior
               | because of their lack or soft and hard skills.
        
               | tptacek wrote:
               | Oh, that's _one_ of the ways a senior dev becomes senior.
        
         | latexr wrote:
         | > It just blurs the line between engineer and tool.
         | 
         | I realise you meant it as "the engineer and their tool blend
         | together", but I read it like a funny insult: "that guy likes
         | to think of himself as an engineer, but he's a complete tool".
        
         | visarga wrote:
         | > prompt engineers point at them as evidence of the fast-
         | approaching software engineer obsolescence
         | 
         | Maybe journalists and bloggers angling for attention do it,
         | prompt engineers are too aware of the limitations of prompting
         | to do that.
        
         | thegrim33 wrote:
         | I mean yeah, the very first prompt given to the AI was put
         | together by an experienced developer; a bunch of code telling
         | the AI exactly what the API should look like and how it would
         | be used. The very first step in the process already required an
         | experienced developer to be involved.
        
       | thorum wrote:
       | Humorous that this article has a strong AI writing smell - the
       | author should publish the prompts they used!
        
         | dcre wrote:
         | I don't like to accuse, and the article is fine overall, but
         | this stinks: "This transparency transforms git history from a
         | record of changes into a record of intent, creating a new form
         | of documentation that bridges human reasoning and machine
         | implementation."
        
           | keybored wrote:
           | > I don't like to accuse, and the article is fine overall,
           | but this stinks:
           | 
           | Now consider your reasonable instinct to not accuse other
           | people coupled with the possibility setting AI lose with
           | "write a positive article about AI where you have some
           | paragraphs about the current limitations based on this link.
           | write like you are just following the evidence." Meanwhile we
           | are supposed to sit here and weigh every word.
           | 
           | This reminds to write a prompt for a blogpost. How AI could
           | be used for making personal-looking tech-guy who meditates
           | and runs websites. (Do we have the technology? Yes we do)
        
           | ZephyrBlu wrote:
           | Also: " _This OAuth library represents something larger than
           | a technical milestone--it 's evidence of a new creative
           | dynamic emerging_"
           | 
           | Em-dash baby.
        
             | latexr wrote:
             | Can we please stop using the em-dash as a metric to
             | "detect" LLM writing? It's lazy and wrong. Plenty of people
             | use em-dashes, _it's a useful punctuation mark_. If humans
             | didn't use them, they wouldn't be in the LLM training data.
             | 
             | There are better clues, like the kind of vague pretentious
             | babble bad marketers use to make their products and ideas
             | seem more profound than they are. It's a type of bad
             | writing which looks grandiose but is ultimately meaningless
             | and that LLMs heavily pick up on.
        
               | grey-area wrote:
               | _Very_ few people use n dashes in internet writing as
               | opposed to dashes as they are not available on the
               | default keyboard.
        
               | latexr wrote:
               | That's not true at all. Apple's OS by default have smart
               | punctuation enabled and convert -- (two hyphens) into --
               | ("em-dash"; _not_ an "en-dash", which has a different
               | purpose),  " " (dumb quotes) into " " (smart quotes), and
               | so forth.
               | 
               | Furthermore, on macOS there are simple key combinations
               | (e.g. with [?]) to make all sort of smart punctuation
               | even if you don't have the feature enabled by default,
               | and on iOS you can long press on a key (such as the
               | hyphen) to see alternates.
               | 
               | The majority of people may not use correct punctuation
               | marks, but enough do that assuming a single character
               | immediately means they used an LLM is just plain wrong. I
               | have _never_ used an LLM to write a blog post, internet
               | comment, or anything of the sort, and I have used smart
               | punctuation in all my writing for over a decade. Same
               | with plenty of other HN commenters, journalists, writers,
               | editors, and on and on. You don't need to be a literal
               | machine to care about correct character use.
        
               | grey-area wrote:
               | So we've established the default is a hyphen, not an em
               | dash.
               | 
               | You can certainly select an em dash but most don't know
               | what it means and don't use it.
               | 
               | It's certainly not infallible proof but multiple uses of
               | it in comments online (vs published material or
               | newspapers) are very unusual, so I think it's an
               | interesting indicator. I completely agree it is common in
               | some texts, usually ones from publishing houses with
               | style guides but also people who know about writing or
               | typography.
        
               | thoroughburro wrote:
               | On the "default keyboard" of most people (a phone), you
               | just long-press hyphen to choose any dash length.
        
               | grey-area wrote:
               | But who does? Not many.
        
               | purplesyringa wrote:
               | This is a post _with formatting_ and we 're programmers
               | here. I can assure you their editor (or Markdown)
               | supports em-dash in some fashion.
        
             | ZeroTalent wrote:
             | I have used Em-dashes in many of my comments for years.
             | It's just a result of reading books, where Em-dashes happen
             | a lot.
        
             | mpalmer wrote:
             | The sentence itself is a smeLLM. Grandiose pronouncements
             | aren't a bot exclusive, but man do they love making them,
             | especially about creative paradigms and dynamics
        
           | OjotCewIo wrote:
           | > this stinks: "This transparency transforms git history from
           | a record of changes into a record of intent, creating a new
           | form of documentation that bridges human reasoning and
           | machine implementation."
           | 
           | That's where I stopped reading. If they needed "AI" for
           | turning their git history into a record of intent
           | ("transparency"), then they had been doing it all wrong,
           | previously. Git commit messages have _always_ been a  "form
           | of documentation that bridges human reasoning" -- namely,
           | with another human's (the reader's) reasoning.
           | 
           | If you don't walk your reviewer through your patch, in your
           | commit message, as if you were _teaching_ them, then you 're
           | doing it wrong.
           | 
           | Left a bad taste in my mouth.
        
         | maxemitchell wrote:
         | I did human notes -> had Claude condense and edit -> manually
         | edit. A few of the sentences (like the stinky one below) were
         | from Claude which I kept if it matched my own thoughts, though
         | most were changed for style/prose.
         | 
         | I'm still experimenting with it. I find it can't match style at
         | all, and even with the manual editing it still "smells like AI"
         | as you picked up. But, it also saves time.
         | 
         | My prompt was essentially "here are my old blog posts, here's
         | my notes on reading a bunch of AI generated commits, help me
         | condense this into a coherent article about the insights I
         | learned"
        
       | fpgaminer wrote:
       | I used almost 100% AI to build a SCUMM-like parser, interpreter,
       | and engine (https://github.com/fpgaminer/scumm-rust). It was a
       | fun workflow; I could generally focus on my usual work and just
       | pop in occasionally to check on and direct the AI.
       | 
       | I used a combination of OpenAI's online Codex, and Claude Sonnet
       | 4 in VSCode agent mode. It was nice that Codex was more automated
       | and had an environment it could work in, but its thought-logs are
       | terrible. Iteration was also slow because it takes awhile for it
       | to spin the environment up. And while you _can_ have multiple
       | requests running at once, it usually doesn't make sense for a
       | single, somewhat small project.
       | 
       | Sonnet 4's thoughts were much more coherent, and it was fun to
       | watch it work and figure out problems. But there's something
       | broken in VSCode right now that makes its ability to read console
       | output inconsistent, which made things difficult.
       | 
       | The biggest issue I ran into is that both are set up to seek out
       | and read only small parts of the code. While they're generally
       | good at getting enough context, it does cause some degradation in
       | quality. A frequent issue was replication of CSS styling between
       | the Rust side of things (which creates all of the HTML elements)
       | and the style.css side of things. Like it would be working on the
       | Rust code and forget to check style.css, so it would just
       | manually insert styles on the Rust side even though those
       | elements were already styled on the style.css side.
       | 
       | Codex is also _terrible_ at formatting and will frequently muck
       | things up, so it's mandatory to use it with an autoformatter and
       | instructions to use it. Even with that, Codex will often say that
       | it ran it, but didn't actually run it (or ran it somewhere in the
       | middle instead of at the end) so its pull requests fail CI.
       | Sonnet never seemed to have this issue and just used the
       | prevailing style it saw in the files.
       | 
       | Now, when I say "almost 100% AI", it's maybe 99% because I did
       | have to step in and do some edits myself for things that both
       | failed at. In particular neither can see the actual game running,
       | so they'd make weird mistakes with the design. (Yes, Sonnet in VS
       | Code can see attached images, and potentially can see the DOM of
       | vscode's built in browser, but the vision of all SOTA models is
       | ass so it's effectively useless). I also stepped in once to do
       | one major refactor. The AIs had decided on a very strange, messy,
       | and buggy interpreter implementation at first.
        
       | eviks wrote:
       | > Imagine version control systems where you commit the prompts
       | used to generate features rather than the resulting
       | implementation.
       | 
       | So every single run will result in different non-reproducible
       | implementation with unique bugs requiring manual expert
       | interventions. How is this better?
        
       | cosmok wrote:
       | I have documented my experience using an Agent for a slightly
       | different task -- upgrading framework version -- I had to abandon
       | the work, but, my learning has been similar what is in the post.
       | 
       | https://www.trk7.com/blog/ai-agents-for-coding-promise-vs-re...
        
       | never_inline wrote:
       | > Don't be afraid to get your hands dirty. Some bugs and styling
       | issues are faster to fix manually than to prompt through. Knowing
       | when to intervene is part of the craft.
       | 
       | This has been my experience as well. to always run the cli tool
       | in the bottom pane of an IDE and not in a standalone terminal.
        
       | brador wrote:
       | Many of you are failing to conprehend the potential scale of AI
       | generated codebases.
       | 
       | Take note - there is no limit. Every feature you or the AI can
       | prompt can be generated.
       | 
       | Imagine if you were immortal and given unlimited storage. Imagine
       | what you could create.
       | 
       | That's a prompt away.
       | 
       | Even now you're still restricting your thinking to the old ways.
        
         | latexr wrote:
         | You're sounding like a religious zealot recruiting for a cult.
         | 
         | No, it is not possible to prompt every feature, and I suspect
         | people who believe LLMs can accurately program anything in any
         | language are frankly not solving any truly novel or interesting
         | problems, because if they were they'd see the obvious cracks.
        
           | nojito wrote:
           | > I suspect people who believe LLMs can accurately program
           | anything in any language are frankly not solving any truly
           | novel or interesting problems, because if they were they'd
           | see the obvious cracks.
           | 
           | The vast majority of problems in programming aren't novel or
           | interesting.
        
         | politelemon wrote:
         | > That's a prompt away.
         | 
         | Currently, it's 6 prompts away in which 5 of those are me
         | guiding the LLM to output the answer that I already have in
         | mind.
        
         | _lex wrote:
         | You're talking ahead of the others in this thread, who do not
         | understand how you got to what you're saying. I've been doing
         | research in this area. You are not only correct, but the
         | implications are staggering, and go further than what you have
         | mentioned above. This is no cult, it is the reorganization of
         | the economics of work.
        
           | OjotCewIo wrote:
           | > it is the reorganization of the economics of work
           | 
           | and the overwhelming majority of humanity will be worse off
           | for it
        
       | UltraSane wrote:
       | I was thinking that if you had a good enough verified
       | mathematical model of your code using TLA+ or similar you could
       | then use an LLM to generate your code in any language and be
       | confident it is correct. This would be Declarative Programming.
       | Instead of putting in a lot of work writing code that MIGHT do
       | what you intend you put more work into creating the verified
       | model and then the LLM generates code that will do what the model
       | intends.
        
       | kookamamie wrote:
       | > Treat prompts as version-controlled assets
       | 
       | This only works if the model and its context are immutable. None
       | of us really control the models we use, so I'd be sceptical about
       | reproducing the artifacts later.
        
       | lmeyerov wrote:
       | If/when to commit prompts has been fascinating as we have been
       | doing similarly to build Louie.ai. I now have several categories
       | with different handling:
       | 
       | - Human reviewed: Code guidelines and prompt templates are
       | essentially dev tool infra-as-code and need review
       | 
       | - Discarded: Individual prompt commands I write, and
       | implementation plan progress files the AI write, both get
       | trashed, and are even part of my .gitignore . They were kept by
       | Cloudflare, but we don't keep these.
       | 
       | - Unreviewed: Claude Code does not do RAG in the usual sense, so
       | it is on us to create guides for how we do things like use big
       | frameworks. They are basically indexes for speeding up AI with
       | less grepping + hallucinating across memory compactions. The AI
       | reads and writes these, and we largely stay out of it.
       | 
       | There are weird cases I am still trying to figure out. Ex:
       | 
       | - feature impl might start with an AI coming up with the product
       | spec, so having that maintained as the AI progresses and
       | committed in is a potentially useful artifact
       | 
       | - how prompt templates get used is helpful for their automated
       | maintenance.
        
       | Fischgericht wrote:
       | So, it means that you and the LLM together have managed to write
       | SEVEN lines of trivial code per hour. On a protocol that is
       | perfectly documented, where you can look at about one million
       | other implementations when in doubt.
       | 
       | It is not my intention to hurt your feelings, but it sounds like
       | you and/or the LLM are not really good at their job. Looking at
       | programmer salaries and LLM energy costs, this appears to be a
       | very very VERY expensive OAuth library.
       | 
       | Again: Not my intention to hurt any feelings, but the numbers
       | really are shockingly bad.
        
         | Fischgericht wrote:
         | Yes, my brain got confused on who wrote the code and who just
         | reported about it. I am truly sorry. I will go see my LLM
         | doctor to get my brain repaired.
        
         | kentonv wrote:
         | I spent about 5 days semi-focused on this codebase (though I
         | always have lots of people interrupting me all the time). It's
         | about 5000 lines (if you count comments, tests, and
         | documentation, which you should). Where do you get 7 lines per
         | hour?
        
         | nojito wrote:
         | >So, it means that you and the LLM together have managed to
         | write SEVEN lines of trivial code per hour.
         | 
         | Here's their response
         | 
         | >It took me a few days to build the library with AI.
         | 
         | >I estimate it would have taken a few weeks, maybe months to
         | write by hand.
         | 
         | >That said, this is a pretty ideal use case: implementing a
         | well-known standard on a well-known platform with a clear API
         | spec.
         | 
         | https://news.ycombinator.com/item?id=44160208
         | 
         | Lines of code per hour is a terrible metric to use.
         | Additionally, it's far easier to critique code that's already
         | written!
        
       | moron4hire wrote:
       | I'm sorry, this all sounds like a fucking miserable experience.
       | Like, if this is what my job becomes, I'll probably quit tech
       | completely.
        
         | kentonv wrote:
         | That's exactly what I thought, too, before I tried it!
         | 
         | Turns out it feels very different than I expected. I really
         | recommend trying it rather than assuming. There's no learning
         | curve, you just install Claude Code and run it in your repo and
         | ask it for things.
         | 
         | (I am the author of the code being discussed. Or, uh, the
         | author of the prompts at least.)
        
       | Arainach wrote:
       | >Around the 40-commit mark, manual commits became frequent
       | 
       | This matches my experience: some shiny (even sometimes
       | impressive) greenspace demos but dramatically less useful
       | maintaining a codebase - which for any successful product is 90%
       | of the work.
        
       | Lerc wrote:
       | > _Treat prompts as version-controlled assets. Including prompts
       | in commit messages creates valuable context for future
       | maintenance and debugging._
       | 
       | I think this is valuable data, but it is also out of distribution
       | data. Prior to AI models writing code, this won't be present in
       | the training set. Additional training will probably be needed to
       | correlate better results with the new input stream, and also to
       | learn that some of the records would be of its own unreliability
       | and to develop a healthy scepticism of what it has said in the
       | past.
       | 
       | There's a lot of talk about model collapse with models training
       | purely on their own output, or AI slop infecting training data
       | sets, but ultimately it is all data. Combined with a signal to
       | say which bits were ultimately beneficial, it can all be put to
       | use. Even the failures can provide a good counterfactual signal
       | for constrastive learning.
        
       | _pdp_ wrote:
       | I commented on the original discussion a few days ago but I will
       | do it again.
       | 
       | Why is this such a big deal? This library is not even that
       | interesting. It is very straightforward task I expect most
       | programers will be able to pull off easily. 2/3 of the code is
       | type interfaces and comments. The rest is by book implementation
       | of a protocol that is not even that complex.
       | 
       | Please, there are some React JSX files in your code base with a
       | lot more complexities and intricacies than this.
       | 
       | Has anyone even read the code at all?
        
       | axi0m wrote:
       | >> what if we treated prompts as the actual source code?
       | 
       | And they probably will be. Looks like prompts have become the new
       | higher-level coding language, the same way JavaScript is a human-
       | friendly abstraction of an existing programming language (like
       | C), which is already a more accessible way to write assembly
       | itself, and the same goes for the underlying binary code... I
       | guess we eventually reached the final step in the development
       | chain, bridging the gap between hardware instructions and human
       | language.
        
       | starkparker wrote:
       | > Almost every feature required multiple iterations and
       | refinements. This isn't a limitation--it's how the collaboration
       | works.
       | 
       | I guess that's where a big miss in understanding so much of the
       | messaging about generative AI in coding happens for me, and why
       | the Fly.io skepticism blog post irritated me so much as well.
       | 
       | It _is_ how collaboration with a person works, but the when you
       | have to fix the issues that the tool created, you aren't
       | collaborating with a person, you're making up for a broken tool.
       | 
       | I can't think of any field where I'd be expected to not only put
       | up with, but also celebrate, a tool that screwed up and required
       | manual intervention so often.
       | 
       | The level of anthropomorphism that occurs in order to advocate on
       | behalf of generative AI use leads to saying things like "it's how
       | collaboration works" here, when I'd never say the same thing
       | about the table saw in my woodshop, or even the relatively smart
       | cruise control on my car.
       | 
       | Generative AI is still just a tool built by people following a
       | design, and which purportedly makes work easier. But when my saw
       | tears out cuts that I have to then sand or recut, or when my car
       | slams on the brakes because it can't understand a bend in the
       | road around a parking lane, I don't shrug and ascribe them human
       | traits and blame myself for being frustrated over how they
       | collaborate with me.
        
       ___________________________________________________________________
       (page generated 2025-06-07 23:01 UTC)