[HN Gopher] I read all of Cloudflare's Claude-generated commits
___________________________________________________________________
I read all of Cloudflare's Claude-generated commits
Author : maxemitchell
Score : 201 points
Date : 2025-06-06 22:35 UTC (1 days ago)
(HTM) web link (www.maxemitchell.com)
(TXT) w3m dump (www.maxemitchell.com)
| SupremumLimit wrote:
| It's an interesting review but I really dislike this type of
| techno-utopian determinism: "When models inevitably improve..."
| Says who? How is it inevitable? What if they've actually reached
| their limits by now?
| Dylan16807 wrote:
| Models are improving every day. People are figuring out
| thousands of different optimizations to training and to
| hardware efficiency. The idea that right now in early June 2025
| is when improvement _stops_ beggars belief. We might be
| approaching a limit, but that 's going to be a sigmoid curve,
| not a sudden halt in advancement.
| deadbabe wrote:
| 5 years ago a person would be blown away by today's LLMs. But
| people today will merely say "cool" at whatever LLMs are in
| use 5 years from now. Or maybe not even that.
| dingnuts wrote:
| 5 years ago GPT2 was already outputting largely coherent
| speech, there's been progress but it's not all that
| shocking
| tptacek wrote:
| Most of the developers I know personally who have been
| radicalized by coding agents, it happened within the past 9
| months. It does not feel like we are in a phase of
| predictable boring improvement.
| keybored wrote:
| Radicalized? Going with the flow and wishes of the people
| who are driving AI is the opposite of that.
|
| To have their minds changed drastically, sure..
| tptacek wrote:
| Sorry I have no idea what you're trying to say here.
| lcnPylGDnU4H9OF wrote:
| > very different from the usual or traditional
|
| https://www.merriam-webster.com/dictionary/radical
|
| Deciding that AI is going nowhere to suddenly deciding
| that coding agents are how they will work going forward
| is a radical change. That is what they meant.
| keybored wrote:
| Did you miss my second paragraph?
|
| https://www.merriam-webster.com/dictionary/paragraph
| Dylan16807 wrote:
| Can you explain exactly what you meant by your second
| paragraph? The ambiguity is why you got that reply.
|
| If your second paragraph makes that reply irrelevant, are
| you saying the meaning was "Your use of 'radicalized' is
| technically correct but I still think you shouldn't have
| used it here"?
| dwaltrip wrote:
| Bold prediction...
| sitkack wrote:
| It is copium that it will suddenly stop and the world they
| knew before will return.
|
| ChatGPT came out in Nov 2022. Attention Was All There Was in
| 2017, we were already 5 years in the past. Or 5 years of
| research to catch up to, and then from 2022 to now ... papers
| and research have been increasing exponentially. Even in if
| SOTA models were frozen, we still have years of research to
| apply and optimize in various ways.
| BoorishBears wrote:
| I think it's equally copium that people keep assuming we're
| just going to compound our way into intelligence that
| generalizes enough to stop us from handholding the AI, as
| much as I'd _genuinely_ enjoy that future.
|
| Lately I spend all day post-training models for my product,
| and I want to say 99% of the research specific to LLMs
| doesn't reproduce and/or matter once you actually dig in.
|
| We're getting exponentially more papers on the topics and
| they're getting worse on average.
|
| Every day there's a new paper claiming an X% gain by post-
| training some ancient 8B parameter model and comparing it
| to a bunch of other ancient models after they've overfitted
| on the public dataset of a given benchmark and given the
| model a best of 5.
|
| And benchmarks won't ever show it, but even ChatGPT
| 3.5-Turbo has better general world knowledge than a lot
| models people consider "frontier" models today because
| post-training makes it easy to cover up those gaps with
| very impressive one-prompt outputs and strong benchmark
| scores.
|
| -
|
| It feels like things are getting stuck in a local maxima:
| we _are_ making forward progress, the models _are_ useful
| and getting more useful, but the future people are
| envisioning takes reaching a completely different goal post
| that I 'm not at all convinced we're making exponential
| progress towards.
|
| There maybe exponential number of techniques claiming to be
| ground breaking, but what has actually unlocked new
| capabilities that can't just as easily be attributed to how
| much more focused post-training has become on coding and
| math?
|
| Test time compute feels like the only one and we're already
| seeing the cracks form in terms of its effect on
| hallucinations, and there's a clear ceiling for the
| performance the current iteration unlocks as all these
| models are converging on pretty similar performance after
| just a few model releases.
| rxtexit wrote:
| The copium is I think many people got comfortable post
| financial crisis with nothing much changing or happening. I
| think many people really liked a decade stretch with not
| much more than web framework updates and smart phone
| versioning.
|
| We are just back on track.
|
| I just read Oracular Programming: A Modular Foundation for
| Building LLM-Enabled Software the other day.
|
| We don't even have a new paradigm yet. I would be shocked
| that in 10 years I don't look back at this time of writing
| a prompt into a chatbot and then pasting the code into an
| IDE as completely comical.
|
| The most shocking thing to me is we are right back on track
| to what I would have expected in 2000 for 2025. In 2019
| those expectations seemed like science fiction delusions
| after nothing happening for so long.
| sitkack wrote:
| Reading the Oracular paper now,
| https://news.ycombinator.com/edit?id=44211588
|
| It feels a bit like Halide, where the goal and the
| strategy are separated so that each can be optimized
| independently.
|
| Those new paradigms are being discovered by hordes of
| vibecoders, myself included. I am having wonderful
| results with TDD and AI assisted design.
|
| IDEs are now mostly browsers for code, and I no longer
| copy and paste with a chatbot.
|
| Curious what you think about the Oracular paper. One area
| that I have been working on for the last couple weeks is
| extracting ToT for the domain and then using the LLM to
| generate an ensemble of exploration strategies over that
| tree.
| a2128 wrote:
| I think at this point we're reaching more incremental
| updates, which can score higher on some benchmarks but then
| simultaneously behave worse with real-world prompts, most
| especially if they were prompt engineered for a specific
| model. I recall Google updating their Flash model on their
| API with no way to revert to the old one and it caused a lot
| of people to complain that everything they've built is no
| longer working because the model is just behaving differently
| than when they wrote all the prompts.
| whbrown wrote:
| Isn't it quite possible they replaced that Flash model with
| a distilled version, saving money rather than increasing
| quality? This just speaks to the value of open-weights more
| than anything.
| Sevii wrote:
| Models have improved significantly over the last 3 months. Yet
| people have been saying 'What if they've actually reached their
| limits by now?' for pushing 3 years.
| greyadept wrote:
| For me, improvement means no hallucination, but that only
| seems to have gotten worse and I'm interested to find out
| whether it's actually solvable at all.
| dymk wrote:
| All the benchmarks would disagree with you
| thuuuomas wrote:
| Today's public benchmarks are yesterday's training data.
| BoorishBears wrote:
| The benchmarks also claim random 32B parameter models
| beat Claude 4 at coding, so we know just how much they
| matter.
|
| It should be obvious to anyone who with a cursory
| interest in model training, you can't trust benchmarks
| unless they're fully private black-boxes.
|
| If you can get even a _hint_ of the shape of the
| questions on a benchmark, it 's trivial to synthesize
| massive amounts of data that help you beat the benchmark.
| And given the nature of funding right now, you're almost
| silly _not_ to do it: it 's not cheating, it's
| "demonstrably improving your performance at the
| downstream task"
| tptacek wrote:
| Why do you care about hallucination for coding problems?
| You're in an agent loop; the compiler is ground truth. If
| the LLM hallucinates, the agent just iterates. You don't
| even see it unless you make the mistake of looking closely.
| kiitos wrote:
| What on earth are you talking about??
|
| If the LLM hallucinates, then the code it produces is
| wrong. That wrong code isn't obviously or
| programmatically determinable as wrong, the agent has no
| way to figure out that it's wrong, it's not as if the LLM
| produces at the same time tests that identify that
| hallucinated code as being wrong. The only way that this
| wrong code can be identified as wrong is by the human
| user "looking closely" and figuring out that it is wrong.
|
| You seem to have this fundamental belief that the code
| that's produced by your LLM is valid and doesn't need to
| be evaluated, line-by-line, by a human, before it can be
| committed?? I have no idea how you came to this belief
| but it certainly doesn't match my experience.
| tptacek wrote:
| No, what's happening here is we're talking past each
| other.
|
| An agent _lints and compiles code_. The LLM is stochastic
| and unreliable. The agent is ~200 lines of Python code
| that checks the exit code of the compiler and relays it
| back to the LLM. You can easily fool an LLM. You can 't
| fool the compiler.
|
| I didn't say anything about whether code needs to be
| reviewed line-by-line by humans. I review LLM code line-
| by-line. Lots of code that compiles clean is nonetheless
| horrible. But none of it includes hallucinated API calls.
|
| Also, from where did this "you seem to have a fundamental
| belief" stuff come from? You had like 35 words to go on.
| someothherguyy wrote:
| Linting isn't verification of correctness, and yes, you
| can fool the compiler, linters, etc. Work with some human
| interns, they are great at it. Agents will do crazy
| things to get around linting errors, including removing
| functionality.
| fragmede wrote:
| have you no tests?
| kiitos wrote:
| Irrelevant, really. Tests establish a minimum threshold
| of acceptability, they don't (and can't) guarantee
| anything like overall correctness.
| tptacek wrote:
| Just checking off the list of things you've determined to
| be irrelevant. Compiler? Nope. Linter? Nope. Test suite?
| Nope. How about TLA+ specifications?
| skydhash wrote:
| TLA+ specs don't verify code. They verify design. Such
| design can be expressed in whatever, including pseudocode
| (think algorithms notation in textbooks). Then you write
| the TLA specs that will judge if invariants are truly
| respected. Once you're sure of the design, you can go and
| implement it, but there's no hard constraints like a type
| system.
| tptacek wrote:
| At what level of formal methods verification does the
| argument against AI-generated code fall apart? My
| expectation is that the answer is "never".
|
| The subtext is pretty obvious, I think: that standards,
| on message boards, are being set for LLM-generated code
| that are ludicrously higher than would be set for people-
| generated code.
| kiitos wrote:
| I truly don't know what you're trying to communicate,
| with all of your recent comments related to AI and LLM
| and codegen and etc., the only thing I can guess is that
| you're just cynically throwing sand into the wind. It's
| unfortunate, your username used to carry some clout and
| respect.
| kiitos wrote:
| > If the LLM hallucinates, then the code it produces is
| wrong. That wrong code isn't obviously or
| programmatically determinable as wrong, the agent has no
| way to figure out that it's wrong, it's not as if the LLM
| produces at the same time tests that identify that
| hallucinated code as being wrong. The only way that this
| wrong code can be identified as wrong is by the human
| user "looking closely" and figuring out that it is wrong
|
| The LLM can easily hallucinate code that will satisfy the
| agent and the compiler but will still fail the actual
| intent of the user.
|
| > I review LLM code line-by-line. Lots of code that
| compiles clean is nonetheless horrible.
|
| Indeed _most_ code that LLMs generate compiles clean and
| is nevertheless horrible! I 'm happy that you recognize
| this truth, but the fact that you review that LLM-
| generated code line-by-line makes you an extraordinary
| exception vs. the normal user, who generates LLM code and
| absolutely does not review it line-by-line.
|
| > But none of [the LLM generated code] includes
| hallucinated API calls.
|
| Hallucinated API calls are just one of many many possible
| kinds of hallucinated code that an LLM can generate, by
| no means does "hallucinated code" describe only
| "hallucinated API calls" -- !
| tptacek wrote:
| When you say "the LLM can easily hallucinate code that
| will satisfy the compiler but still fail the actual
| intent of the user", all you are saying is that the code
| will have bugs. My code has bugs. So does yours. You
| don't get to use the fancy word "hallucination" for
| reasonable-looking, readable code that compiles and lints
| but has bugs.
|
| I think at this point our respective points have been
| made, and we can wrap it up here.
| someothherguyy wrote:
| Hallucination is a fancy word?
|
| The parent seems to be, in part, referring to "reward
| hacking", which tends to be used as a super category to
| what many refer to as slop, hallucination, cheating, and
| so on.
|
| https://courses.physics.illinois.edu/ece448/sp2025/slides
| /le...
| kiitos wrote:
| > When you say "the LLM can easily hallucinate code that
| will satisfy the compiler but still fail the actual
| intent of the user", all you are saying is that the code
| will have bugs. My code has bugs. So does yours. You
| don't get to use the fancy word "hallucination" for
| reasonable-looking, readable code that compiles and lints
| but has bugs.
|
| There is an obvious and categorical difference between
| the "bugs" that an LLM produces as part of its generated
| code, and the "bugs" that I produce as part of the code
| that I write. You don't get to conflate these two classes
| of bugs as though they are equivalent, or even
| comparable. They aren't.
| tptacek wrote:
| They obviously are.
| simonw wrote:
| You seem to be using "hallucinate" to mean "makes
| mistakes".
|
| That's not how I use it. I see hallucination as a very
| specific kind of mistake: one where the LLM outputs
| something that is entirely fabricated, like a class
| method that doesn't exist.
|
| The agent compiler/linter loop can entirely eradicate
| those. That doesn't mean the LLM won't make plenty of
| other mistakes that don't qualify as hallucinations by
| the definition I use!
|
| It's newts and salamanders. Every newt is a salamander,
| not every salamander is a newt. Every hallucination is a
| mistake, not every mistake is a hallucination.
|
| https://simonwillison.net/2025/Mar/2/hallucinations-in-
| code/
| kiitos wrote:
| I'm not using "hallucinate" to mean "makes mistakes". I'm
| using it to mean "code that is syntactically correct and
| passes tests but is semantically incoherent". Which is
| the same thing that "hallucination" normally means in the
| context of a typical user LLM chat session.
| saagarjha wrote:
| My guy didn't you spend like half your life in the field
| where your job was to sift through code that compiled but
| nonetheless had bugs that you tried to exploit? How can
| you possibly have this belief about AI generated code?
| tptacek wrote:
| I don't understand this question. Yes, I spent about 20
| years learning the lesson that code is profoundly
| knowable; to start with, you just read it. What challenge
| do you believe AI-generated code presents to me?
| lcnPylGDnU4H9OF wrote:
| > You seem to have this fundamental belief that the code
| that's produced by your LLM is valid and doesn't need to
| be evaluated, line-by-line, by a human, before it can be
| committed??
|
| This is a mistaken understanding. The person you
| responded to has written on these thoughts already and
| they used memorable words in response to this proposal:
|
| > Are you a vibe coding Youtuber? Can you not read code?
| If so: astute point. Otherwise: what the fuck is wrong
| with you?
|
| It should be obvious that one would read and verify the
| code before they commit it. Especially if one works on a
| team.
|
| https://fly.io/blog/youre-all-nuts/
| kasey_junk wrote:
| We should go one step past this and come up with an
| industry practice where we get someone other than the
| author to read the code before we merge it.
| BoorishBears wrote:
| This is just people talking past each other.
|
| If you want a model that's getting better at helping you as a
| tool (which for the record, I do), then you'd say in the last
| 3 months things got better between Gemini's long context
| performance, the return of Claude Opus, etc.
|
| But if your goal post is replacing SWEs entirely... then it's
| not hard to argue we definitely didn't overcome any new
| foundational issues in the last 3 months, and not too many
| were solved in the last 3 years even.
|
| In the last year the only real _foundational_ breakthrough
| would be RL-based reasoning w / test time compute delivering
| real results, but what that does to hallucinations + even
| Deepseek catching up with just a few months of post-training
| shows in its current form, the technique doesn't completely
| blow up any barriers that were standing the way people were
| originally touting it.
|
| Overall models are getting better at things we can trivially
| post-train and synthesize examples for, but it doesn't feel
| like we're breaking unsolved problems at a substantially
| accelerated rate (yet.)
| atomlib wrote:
| https://xkcd.com/605/
| groby_b wrote:
| It is "inevitable" in the sense that in 99% of the cases,
| tomorrow is just like yesterday.
|
| LLMs have been continually improving for years now. The
| surprising thing would be them not improving further. And if
| you follow the research even remotely, you know they'll improve
| for a while, because not all of the breakthroughs have landed
| in commercial models yet.
|
| It's not "techno-utopian determinism". It's a clearly visible
| trajectory.
|
| Meanwhile, if they didn't improve, it wouldn't make a
| significant change to the overall observations. It's picking a
| minor nit.
|
| The observation that strict prompt adherence plus prompt
| archival could shift how we program is both true, and it's a
| phenomenon we observed several times in the past. Nobody keeps
| the assembly output from the compiler around anymore, either.
|
| There's definitely valid criticism to the passage, and it's
| overly optimistic - in that most non-trivial prompts are still
| underspecified and have multiple possible implementations, not
| all correct. That's both a more useful criticism, and not tied
| to LLM improvements at all.
| double0jimb0 wrote:
| Are there places that follow the research that speak to the
| layperson?
| sumedh wrote:
| More compute mean more faster processing, more context.
| its-kostya wrote:
| What is ironic, if we buy in to the theory that AI will write
| majority of the code in the next 5-10 years, what is it going
| to train on after? ITSELF? Seems this theoretic trajectory of
| "will inevitably get better" is is only true if humans are
| producing quality training data. The quality of code LLMs
| create is very well proportionate on how mature and ubiquitous
| the langues/projects are.
| solarwindy wrote:
| I think you neatly summarise why the current pre-trained LLM
| paradigm is a dead end. If these models were really capable
| of artificial reasoning and _learning_ , they wouldn't need
| more training data at all. If they could learn like a human
| junior does, and actually progress to being a senior, then I
| really could believe that we'll all be out of a job--but they
| just _do not_.
| SrslyJosh wrote:
| > Reading through these commits sparked an idea: what if we
| treated prompts as the actual source code? Imagine version
| control systems where you commit the prompts used to generate
| features rather than the resulting implementation.
|
| Please god, no, never do this. For one thing, why would you _not_
| commit the generated source code when storage is essentially
| free? That seems insane for multiple reasons.
|
| > When models inevitably improve, you could connect the latest
| version and regenerate the entire codebase with enhanced
| capability.
|
| How would you know if the code was better or worse if it was
| never committed? How do you audit for security vulnerabilities or
| debug with no source code?
| Sevii wrote:
| There are lots of reasons not to do it. But if LLMs get good
| enough that it works consistently people will do it anyway.
| minimaxir wrote:
| What will people call it when coders rely on vibes even more
| than vibe coding?
| roywiggins wrote:
| Haruspicy?
| brookst wrote:
| Writing specs
| auggierose wrote:
| Exactly my thought. This is just natural language as a
| specification language.
| kiitos wrote:
| ...as an ambiguous and inadequately-specified
| specification language.
| rectang wrote:
| >> _what if we treated prompts as the actual source code?_
|
| You would not do this because: unlike programming languages,
| natural languages are ambiguous and thus inadequate to fully
| specify software.
| a012 wrote:
| Prompts are like story on the board, and like engineers,
| depends on the understanding of the model the generated
| source code can vary. Saying the prompts could be the actual
| code is so wrong and dangerous thought
| squillion wrote:
| Exactly!
|
| > this assumes models can achieve strict prompt adherence
|
| What does strict adherence to an ambiguous prompt even mean?
| It's like those people asking Babbage if his machine would
| give the right answer when given the wrong figures. _I am not
| able rightly to apprehend the kind of confusion of ideas that
| could provoke such a proposition._
| tayo42 wrote:
| I'm pretty sure most people aren't doing "software engineering"
| when they program. There's the whole world of WordPress and
| dream Weaver like programing out there too where the
| consequences of messing up aren't really important.
|
| Llms can be configured to have deterministic output too
| fastball wrote:
| The idea as stated is a poor one, but a slight reshuffling and
| it seems promising:
|
| You generate code with LLMs. You write tests for this code,
| either using LLMs or on your own. You of course commit your
| actual code: it is required to actually run the program, after
| all. However you also save the entire prompt chain somewhere.
| Then (as stated in the article), when a much better model comes
| along, you re-run that chain, presumably with prompting like
| "create this project, focusing on efficiency" or "create this
| project in Rust" or "create this project, focusing on
| readability of the code". Then you run the tests against the
| new codebase and if the suite passes you carry on, with a much
| improved codebase. The theoretical benefit of this over just
| giving your previously generated code to the LLM and saying
| "improve the readability" is that the newer (better) LLM is not
| burdened by the context of the "worse" decisions made by the
| previous LLM.
|
| Obviously it's not actually that simple, as tests don't catch
| everything (tho with fuzz testing and complete coverage and
| such they can catch most issues), but we programmers often
| treat them as if they do, so it might still be a worthwhile
| endeavor.
| stingraycharles wrote:
| Means the temperature should be set to 0 (which not every
| provider supports) so that the output becomes entirely
| deterministic. Right now with most models if you give the
| same input prompt twice it will give two different solutions.
| NitpickLawyer wrote:
| Even at temp 0, you might get different answers, depending
| on your inference engine. There might be hardware
| differences, as well as software issues (e.g. vLLM
| documents this, if you're using batching, you might get
| different answers depending on where in the batch sequence
| your query landed).
| derwiki wrote:
| Two years ago when I was working on this at a startup,
| setting OAI models' temp to 0 still didn't make them
| deterministic. Has that changed?
| fastball wrote:
| I would only care about more deterministic output if I was
| repeating the same process with the same model, which is
| not the point of the exercise.
| weird-eye-issue wrote:
| Claude Code already uses a temperature of 0 (just inspect
| the requests) but it's not deterministic
|
| Not to mention it also performs web searches, web fetching
| etc which would also make it not deterministic
| afiori wrote:
| Do LLMs inference engines have a way to seed their
| randomness? so tho have reproducible outputs with still
| some variance if desired?
| bavell wrote:
| Yes, although it's not always exposed to the end user of
| LLM providers.
| singhrac wrote:
| Production inference is not deterministic because of
| sharding (i.e. parameter weights on several GPUs on the
| same machine or MoE), timing-based kernel choices (e.g.
| torch.backends.cudnn.benchmark), or batched routing in
| MoEs. Probably best to host a small model yourself.
| maxemitchell wrote:
| Your rephrasing better encompasses my idea, and I should have
| emphasized in the post that I do _not_ think this is a good
| idea (nor possible) right now, it was more of a hand-wavy
| "how could we rethink source control in a post-LLM world"
| passing thought I had while reading through all the commits.
|
| Clearly it struck a chord with a lot of the folks here
| though, and it's awesome to read the discourse.
| renewiltord wrote:
| It's been a thing people have done for at least a year
| https://github.com/i365dev/LetterDrop
| gizmo686 wrote:
| My work has involved a project that is almost entirely
| generated code for over a decade. Not AI generated, the actual
| work of the project is in creating the code generator.
|
| One of the things we learned very quickly was that having
| generated source code in the same repository as actual source
| code was not sustainable. The nature of reviewing changes is
| just too different between them.
|
| Another thing we learned very quickly was that attempting to
| generate code, then modify the result is not sustainable; nor
| is aiming for a 100% generated code base. The end result of
| that was that we had to significantly rearchitect the project
| for us to essentially inject manually crafted code into
| arbitrary places in the generated code.
|
| Another thing we learned is that any change in the code
| generator needs to have a feature flag, because _someone_ was
| relying on the old behavior.
| mschild wrote:
| > One of the things we learned very quickly was that having
| generated source code in the same repository as actual source
| code was not sustainable.
|
| Keeping a repository with the prompts, or other commands
| separate is fine, but not committing the generated code at
| all I find questionable at best.
| djtango wrote:
| I didn't read it as that - If I understood correctly,
| generated code must be quarantined very tightly. And
| inevitably you need to edit/override generated code and the
| manner by which you alter it must go through some kind of
| process so the alteration is auditable and can again be
| clearly distinguished from generated code.
|
| Tbh this all sounds very familiar and like classic data
| management/admin systems for regular businesses. The only
| difference is that the data is code and the admins are the
| engineers themselves so the temptation to "just" change
| things in place is too great. But I suspect it doesn't
| scale and is hard to manage etc.
| diggan wrote:
| If you can 100% reproduce the same generated code from the
| same prompts, even 5 years later, given the same versions
| and everything then I'd say "Sure, go ahead and don't saved
| the generated code, we can always regenerate it". As
| someone who spent some time in frontend development, we've
| been doing it like that for a long time with (MB+)
| generated code, keeping it in scm just isn't feasible long-
| term.
|
| But given this is about LLMs, which people tend to run with
| temperature>0, this is unlikely to be true, so then I'd
| really urge anyone to actually store the results
| (somewhere, maybe not in scm specifically) as otherwise you
| won't have any idea about what the code was in the future.
| overfeed wrote:
| > If you can 100% reproduce the same generated code from
| the same prompts, even 5 years later
|
| Reproducible builds with deterministic stacks and local
| compilers are far from solved. Throwing in LLM randomness
| just makes for a spicier environment to not commit the
| generated code.
| saagarjha wrote:
| I feel like using a compiler is in a sense a code generator
| where you don't commit the actual output
| mschild wrote:
| Sure, but compilers are arguably idempotent. Same code
| input, same output. LLMs certainly are not.
| saagarjha wrote:
| Yeah I fully agree (in the other comments here, no less)
| I just think "I don't commit my code" to be a specific
| mindset of what code actually is
| lelanthran wrote:
| > I feel like using a compiler is in a sense a code
| generator where you don't commit the actual output
|
| Compilers are deterministic. Given the same input you
| always get the same output so there's no reason to store
| the output. If you don't get the same output we call it a
| compiler bug!
|
| LLMs do not work this way.
|
| (Aside: Am I the only one who feels that the entire AI
| industry is predicated on replacing only development
| positions? we're looking at, what, 100bn invested, with
| almost no reduce in customer's operating costs other than
| if the customer has developers).
| Atotalnoob wrote:
| LLMs CAN be deterministic. You can control the
| temperature to get the same output repeatedly.
|
| Although I don't really understand why you'd only want to
| store prompts...
|
| What if that model is no longer available?
| saagarjha wrote:
| They're typically not, since they typically rely on
| operators that aren't (e.g. atomics).
| cesarb wrote:
| > Compilers are deterministic. Given the same input you
| always get the same output
|
| Except when they aren't. See for instance
| https://gcc.gnu.org/onlinedocs/gcc-15.1.0/gcc/Developer-
| Opti... or the __DATE__/__TIME__ macros.
| lelanthran wrote:
| From the link:
|
| > You can use the -frandom-seed option to produce
| reproducibly identical object files.
|
| Deterministic.
|
| Also, with regard to __DATE__/__TIME__ macros, those are
| deterministic, because the current date and time are part
| of the inputs.
| tptacek wrote:
| Why does it matter to you if the code generator is
| deterministic? The _code_ is.
|
| If LLM generation was like a Makefile step, part of your
| build process, this concern would make a lot of sense.
| But nobody, anywhere, does that.
| cimi_ wrote:
| I will guess that you are generating orders of magnitude more
| lines of code with your software than people do when building
| projects with LLMs - if this is true I don't think the
| analogy holds.
| saagarjha wrote:
| I think the biggest difference here is that your code
| generator is probably deterministic and you likely are able
| to debug the results it produces rather than treating it like
| a black box.
| buu700 wrote:
| Overloading of the term "generate" is probably creating
| some confused ideas here. An LLM/agent is a lot more
| similar to a human in terms of its transformation of input
| into output than it is to a compiler or code generator.
|
| I've been working on a recent project with heavy use of AI
| (probably around 100 hours of long-running autonomous AI
| sprints over the last few weeks), and if you tried to re-
| run all of my prompts in order, even using the exact same
| models with the exact same tooling, it would almost
| certainly fall apart pretty quickly. After the first few, a
| huge portion of the remaining prompts would be referencing
| code that wouldn't exist and/or responding to things that
| wouldn't have been said in the AI's responses. Meta-
| prompting (prompting agents to prepare prompts for other
| agents) would be an interesting challenge to properly
| encode. And how would human code changes be represented, as
| patches against code that also wouldn't exist?
|
| The whole idea also ignores that AI being fast and cheap
| compared to human developers doesn't make it infinitely
| fast or free, or put it in the same league of quickness and
| cheapness as a compiler. Even if this were conceptually
| feasible, all it would really accomplish is making it so
| that any new release of a major software project takes
| weeks (or more) of build time and thousands of dollars (or
| more) burned on compute.
|
| It's an interesting thought experiment, but the way I would
| put it into practice would be to use tooling that includes
| all relevant prompts / chat logs in each commit message.
| Then maybe in the future an agent with a more advanced
| model could go through each commit in the history one by
| one, take notes on how each change could have been better
| implemented based on the associated commit message and any
| source prompts contained therein, use those notes to inform
| a consolidated set of recommended changes to the current
| code, and then actually apply the recommendations in a
| series of pull requests.
| tptacek wrote:
| People keep saying this and it doesn't make sense. I review
| code. I don't construct a theory of mind of the author of
| the code. With AI-generated code, if it isn't eminently
| reviewable, I reflexively kill the PR and either try again
| or change the tasking.
|
| There's always this vibe that, like, AI code is like an
| IOCCC puzzle. No. It's extremely boring mid-code. Any
| competent developer can review it.
| buu700 wrote:
| I assumed they were describing AI itself as a black box
| (contrasting it with deterministic code generation), not
| the output of AI.
| tptacek wrote:
| Right, I get that, and an LLM call by itself clearly is a
| black box. I just don't get why that's supposed to
| matter. It produces an artifact I can (and must) verify
| myself.
| buu700 wrote:
| Because if the LLM is a black box and its output must
| ultimately be verified by humans, then you can't treat
| conversion of prompts into code as a simple build step as
| though an AI agent were just some sort of compiler. You
| still need to persist the actual code in source control.
| skywhopper wrote:
| There's a huge difference between deterministic generated
| code and LLM generated code. The latter will be different
| every time, sometimes significantly so. Subsequent prompts
| would almost immediately be useless. "You did X, but we want
| Y" would just blow up if the next time through the LLM (or
| the new model you're trying) doesn't produce X at all.
| overfeed wrote:
| > One of the things we learned very quickly was that having
| generated source code in the same repository as actual source
| code was not sustainable
|
| My rule of the thumb is to have both in same repo, but treat
| generated code like binary data. This was informed by when I
| was burned by a tooling regression that broke the generated
| code and the investigation was complicated by having to
| correlate commits across different repositories
| dkubb wrote:
| I love having generated code in the same repo as the
| generator because with every commit I can regenerate the
| code and compare it to make sure it stays in sync. Then it
| forms something similar to a golden tests where if
| something unexpected changes it gets noticed on review.
| david-gpu wrote:
| Please tell us we company you are working for so that we
| don't send our resumes there.
|
| Jokes aside, I have worked in projects where auto-generating
| code was the solution that was chosen and it's always been
| 100% auto-generated, essentially at compilation time. Any
| hand-coded stuff needed to handle corner cases or glue pieces
| together was kept outside of the code generator.
| mellosouls wrote:
| Yes, it's too early to be doing that now, but if you see the
| move to AI-assisted code as _at least_ the same magnitude of
| change as the move from assembly to high level languages, the
| argument makes more sense.
|
| Nobody commits the compiled code; this is the direction we are
| moving in, high level source code is the new assembly.
| Xelbair wrote:
| Worse. Models aren't deterministic! They use temperature value
| to control randomness, just so they can escape local minima!
|
| Regenerated code might behave differently, have different
| bugs(worst case), or not work at all(best case).
| chrishare wrote:
| Nitpick - it's the ML system that is sampling from model
| predictions that has a temperature parameter, not the model
| itself. Temperature and even model aside, there are other
| sources of randomness like the underlying hardware that can
| cause the havoc you describe.
| never_inline wrote:
| Apart from obvious non-reproducibility, the other problem is
| lack of navigable structure. I can't command+click or "show
| usages" or "show definition" any more.
| saagarjha wrote:
| Just ask the AI for those obviously
| visarga wrote:
| The idea is good, but we should commit both documentation and
| tests. They allow regenerating the code at will.
| pollinations wrote:
| I'd say commit a comprehensive testing system with the prompts.
|
| Prompts are in a sense what higher level programming languages
| were to assembly. Sure there is a crucial difference which is
| reproducibility. I could try and write down my thoughts why I
| think in the long run it won't be so problematic. I could be
| wrong of course.
|
| I run https://pollinations.ai which servers over 4 million
| monthly active users quite reliably. It is mostly coded with
| AI. Since about a year there was no significant human commit.
| You can check the codebase. It's messy but not more messy than
| my codebases were pre-LLMs.
|
| I think prompts + tests in code will be the medium-term
| solution. Humans will be spending more time testing different
| architecture ideas and be involved in reviewing and larger
| changes that involve significant changes to the tests.
| maxemitchell wrote:
| Agreed with the medium-term solution. I wish I put some more
| detail into that part of the post, I have more thoughts on it
| but didn't want to stray too far off topic.
| 7speter wrote:
| I think the author is saying you commit the prompt with the
| resulting code. You said it yourself, storage is free, so
| comment the prompt along with the output (don't comment that
| out that if I'm not being clear); it would show the
| developers(?) intent, and to some degree, almost always
| contribute to the documentation process.
| maxemitchell wrote:
| Author here :). Right now, I think the pragmatic thing to do
| is to include all prompts used in either the PR description
| and/or in the commit description. This wouldn't make my
| longshot idea of "regenerating a repo from the ground up"
| possible, but it still adds very helpful context to code
| reviewers and can help others on your team learn prompting
| techniques.
| kace91 wrote:
| Plus, commits depend on the current state of the system.
|
| What sense does "getting rid of vulnerabilities by phasing out
| {dependency}" make, if the next generation of the code might
| not rely on the mentioned library at all? What does "improve
| performance of {method}" mean if the next generation used a
| fully different implementation?
|
| It makes no sense whatsoever except for a vibecoders script
| that's being extrapolated into a codebase.
| croes wrote:
| You couldn't even tell in advance if the prompt produces code
| at all.
| js2 wrote:
| Discussion from 4 days ago when the code was announced (846
| points, 519 comments):
|
| https://news.ycombinator.com/item?id=44159166
| viraptor wrote:
| The documentation angle is really good. I've noticed it with the
| mdc files and llm.txt semi-standard. Documentation is often
| treated as just extra cost and a chore. Now, good description of
| the project structure and good examples suddenly becomes
| something devs want ahead of time. Even if the reason is not
| perfect, I appreciate this shift we'll all benefit from.
| IncreasePosts wrote:
| I asked this in the other thread (no response, but I was a bit
| late)
|
| How does anyone using AI like this have confidence that they
| aren't unintentionally plagiarizing code and violating the terms
| of whatever license it was released under?
|
| For random personal projects I don't see it mattering that much.
| But if a large corp is releasing code like this, one would hope
| they've done some due diligence that they have to just stolen the
| code from some similar repo on GitHub, laundered through a LLM.
|
| The only section in the readme doesn't mention checking similar
| projects or libraries for common code:
|
| > Every line was thoroughly reviewed and cross-referenced with
| relevant RFCs, by security experts with previous experience with
| those RFCs.
| saghm wrote:
| Safety in the shadow of giant tech companies. People were upset
| when Microsoft released Copilot trained on GitHub data, but
| nobody who cared doing do anything about it, and nobody who
| could have done something about it cared, so it just became the
| new norm.
| throwawaysleep wrote:
| As an individual dev, I simply don't care. Not my problem.
|
| Companies are satisfied with the idemnity provided by
| Microsoft.
| akdev1l wrote:
| > How does anyone using AI like this have confidence that they
| aren't unintentionally plagiarizing code and violating the
| terms of whatever license it was released under?
|
| They don't and no one cares
| ryandrake wrote:
| This is an excellent question that the AI-boosters always seem
| to dance around. Three replies already are saying "Nobody
| cares." Until they do. I'd be willing to bet that some time in
| the near future, some big company is going to care _a lot_ and
| that there will be a landmark lawsuit that significantly
| changes the LLM landscape. Regulation or a judge is going to
| eventually decide the extent to which someone can use AI to
| copy someone else's IP, and it's not going to be pretty.
| SpicyLemonZest wrote:
| It just presumes a level of fixation in copyright law that I
| don't think is realistic. There was a landmark lawsuit MAI v.
| Peak Computer in 1993, where judges determined that repairing
| a computer without the permission of the operating system's
| author is copyright infringement, and it didn't change the
| landscape at all because everyone immediately realized it's
| not practical for things to work that way. There's no
| realistic world where AI tools end up being extremely useful
| but nobody uses them because of a court ruling.
| tptacek wrote:
| Most of the code generated by LLMs, and _especially_ the code
| you actually keep from an agent, is mid, replacement-level,
| boring stuff. If you 're not already building projects with
| LLMs, I think you need to start doing that first before you
| develop a strong take on this. From what I see in my own work,
| the code being generated is highly unlikely to be
| distinguishable. There is more of me and my prompts and
| decisions in the LLM code than there can possibly be defensible
| IPR from anybody else, unless the very notion of, like,
| wrapping a SQLite INSERT statement in Golang is defensible.
|
| The best way I can explain the experience of working with an
| LLM agent right now is that it is like if every API in the
| world had a magic "examples" generator that always included
| whatever it was you were trying to do (so long as what you were
| trying to do was within the obvious remit of the library).
| aryehof wrote:
| The consensus for right or wrong, is that LLM produced code
| (unless repeated verbatim) is equivalent to you or I
| legitimately stating our novel understanding of mixed sources
| some of which may be copyrighted.
| simonw wrote:
| All of the big LLM vendors have a "copyright shield" indemnity
| clause for their paying customers - a guarantee that if you get
| sued over IP for output from their models their legal team will
| step in to fight on your behalf.
| kentonv wrote:
| I'm fairly confident that it's not just plagiarizing because I
| asked the LLM to implement a novel interface with unusual
| semantics. I then prompted for many specific fine-grain changes
| to implement features the way I wanted. It seems entirely
| implausible to me that there could exist prior art that
| happened to be structured exactly the way I requested.
|
| Note that I came into this project believing that LLMs were
| plagiarism engines -- I was looking for that! I ended up
| concluding that this view was not consistent with the output I
| was actually seeing.
| cavisne wrote:
| Some API's (Gemini at least) run a search on their outputs to
| see if the model is reciting data from training.
|
| So for direct copies like what you are talking about that would
| be picked up.
|
| For copying concepts from other libraries, seems like a problem
| with or without LLM's.
| drodgers wrote:
| > Prompts as Source Code
|
| Another way to phrase this is LLM-as-compiler and Python (or
| whatever) as an intermediate compiler artefact.
|
| Finally, a true 6th generation programming language!
|
| I've considered building a toy of this with really aggressive
| modularisation of the output code (eg. python) and a query-based
| caching system so that each module of code output only changes
| when the relevant part of the prompt or upsteam modules change
| (the generated code would be committed to source control like a
| lockfile).
|
| I think that (+ some sort of WASM encapsulated execution
| environment) would one of the best ways to write one off things
| like scripts which _don 't_ need to incrementally get better and
| more robust over time in the way that ordinary code does.
| sumedh wrote:
| > Finally, a true 6th generation programming language!
|
| Karpathy already said English is the new programming language.
| declan_roberts wrote:
| These posts are funny to me because prompt engineers point at
| them as evidence of the fast-approaching software engineer
| obsolescence but the years of experience in software engineering
| necessary to even guide an AI in this way is very high.
|
| The reason he keeps adjusting the prompts is because he knows how
| to program. He knows what it should look like.
|
| It just blurs the line between engineer and tool.
| tptacek wrote:
| I don't know why that's funny. This is not a post about a vibe
| coding session. It's Kenton Varda['s coding session].
|
| _later_
|
| _updated to clarify kentonv didn 't write this article_
| kevingadd wrote:
| I think it makes sense that GP is skeptical of this article
| considering it contains things like:
|
| > this tool is improving itself, learning from every
| interaction
|
| which seem to indicate a fundamental misunderstanding of how
| modern LLMs work: the 'improving' happens by humans
| training/refining existing models offline to create new
| models, and the 'learning' is just filling the context window
| with more stuff, not enhancement of the actual model or the
| model 'learning' - it will forget everything if you drop the
| context and as the context grows it can 'forget' things it
| previously 'learned'.
| BurritoKing wrote:
| When you consider the "tool" as more than just the LLM
| model, but the stuff wrapped around calling that model then
| I feel like you can make a good argument it's improving
| when it keeps context in a file on disk and constantly
| updates and edits that file as you work throguh the
| project.
|
| I do this routinely for large initiatives I'm kicking off
| through Claude Code - it writes a long detailed plan into a
| file and as we work through the project I have it
| constantly updating and rewriting that document to add
| information we have jointly discovered from each bit of the
| work. That means every time I come back and fire it back
| up, it's got more information than when it started, which
| looks a lot more improvement from my perspective.
| tptacek wrote:
| I would love to hear more about this workflow.
| kiitos wrote:
| The sequence of commits talked about by the OP -- i.e.
| kenton's coding session's commits -- are like one degree
| removed from 100% pure vibe coding.
| tptacek wrote:
| Your claim here being that Kenton Varda isn't reading the
| code he's generating. Got it. Good note.
| kiitos wrote:
| No, that's not at all my claim, as it's obvious from the
| commit history that Kenton is reading the code he's
| generating before committing it.
| kentonv wrote:
| What do you mean by "one degree removed from 100% pure
| vibe coding", then? The definition of vibe coding is
| letting the AI code without review...
| kiitos wrote:
| > one degree removed
|
| You're letting Claude do your programming for you, and
| then sweeping up whatever it does afterwards. Bluntly,
| you're off-loading your cognition to the machine. If
| that's fine by you then that's fine enough, it just means
| that the quality of your work becomes a function of your
| tooling rather than your capabilities.
| kentonv wrote:
| I don't agree. The AI largely does the boring and obvious
| parts. I'm still deciding what gets built and how it is
| designed, which is the interesting part.
| tptacek wrote:
| It's the same with me, with the added wrinkle of pulling
| each PR branch down and refactoring things (and,
| ironically, introducing my own bugs).
| kiitos wrote:
| > I'm still deciding what gets built and how it is
| designed, which is the interesting part.
|
| How, exactly? Do you think that you're "deciding what
| gets built and how it's designed" by iterating on the
| prompts that you feed to the LLM that generates the code?
|
| Or are you saying that you're somehow able to write the
| "interesting" code, and can instruct the LLM to generate
| the "boring and obvious" code that needs to be filled-in
| to make your interesting code work? (This is certainly
| not what's indicated by your commit history, but, who
| knows?)
| spaceman_2020 wrote:
| The argument is that this stuff will so radically improve
| senior engineer productivity that the demand for junior
| engineers will crater. And without a pipeline of junior
| engineers, the junior-to-senior trajectory will radically
| atrophy
|
| Essentially, the field will get frozen where existing senior
| engineers will be able to utilize AI to outship traditional
| senior-junior teams, even as junior engineers fail to secure
| employment
|
| I don't think anything in this article counters this argument
| tptacek wrote:
| I don't know why people don't give more credence to the
| argument that the exact opposite thing will happen.
| dcre wrote:
| Right. I don't understand why everyone thinks this will
| make it impossible for junior devs to learn. The people I
| had around to answer my questions when I was learning knew
| a whole lot less than Claude and also had full time jobs
| doing something other than answering my questions.
| fch42 wrote:
| It won't make it impossible for junior engineers to
| learn.
|
| It will simply reduce the amount of opportunities to
| learn (and not just for juniors), by virtue of companies'
| beancounters concluding "two for one" (several juniors)
| doesn't return the same as "buy one get one free"
| (existing staff + AI license).
|
| I dread the day we all "learn from AI". The social
| interaction part of learning is just as important as the
| content of it, really, especially when you're young; none
| of that comes across yet in the pure "1:1 interaction"
| with AI.
| auggierose wrote:
| I learnt programming on my own, without any social
| interaction involved. In fact, I loved programming
| because it does not involve any social interaction.
|
| Programming has become more of a "social game" in the
| last 15 years or so. AI is a new superpower for people
| like me, bringing balance to the Force.
| delegate wrote:
| You learn by doing.. eg typing the code. It's not just
| knowledge, it's the intuition you develop when you write
| code yourself. Just like physical exercise. Or playing an
| instrument. It's not enough to know the theory, practice
| is key.
|
| AI makes it very easy to avoid typing and hence make
| learning this skill less attractive.
|
| But I don't necessarily see it as doom and gloom, what I
| think will happen - juniors will develop advanced
| intuition about using AI and getting the functionality
| they need, not the quality of the code, while at the same
| time the AI models will get increasingly better and write
| higher quality code.
| Ataraxic wrote:
| Junior devs using AI can get a lot better at using AI and
| learn those existing patterns it generates, but I notice,
| for myself, that if I let AI write a lot of the code I
| remember and thereby understand it later on less well.
| This applies in school and when trying to learn new
| things but the act of writing down the solution and
| working out the details yourself trains our own brain.
| I'd say that this has been a practice for over a thousand
| years and I'm skeptical that this will make junior devs
| grow their own skills faster.
|
| I think asking questions to the AI for your own
| understanding totally makes sense, but there is a benefit
| when you actually create the code versus asking the AI to
| do it.
| tptacek wrote:
| I'm sure there is when you're just getting your sea legs
| in some environment, but at some point most of the code
| you write in a given environment is rote. Rote code is
| both depleting and mutagenic --- if you're fluent and
| also interested in programming, you'll start convincing
| yourself to do stupid stuff to make the code less rote
| ("DRY it up", "make a DSL", &c) that makes your code less
| readable and maintainable. It's a trap I fall into
| constantly.
| kiitos wrote:
| > but at some point most of the code you write in a given
| environment is rote
|
| "Most of the code one writes in a given environment is
| rote" is true in the same sense that most of the words
| one writes in any given bit of text are rote e.g.
| conjunctions, articles, prepositions, etc.
| tptacek wrote:
| Some writers I know are convinced this is true, but I
| still don't think the comparison is completely apt,
| because _deliberately_ rote code with _modulated_
| expressiveness is often (even usually) a virtue in
| coding, and not always so with writing. For experienced
| or enthusiastic coders, that is to say, the effort is
| often in not doing stupid stuff to make the code more
| clever.
|
| Straight-line replacement-grade mid code that just does
| the things a prompt tells it to in the least clever most
| straightforward way possible is usually a good thing;
| that long clunky string of modifiers goes by the name
| "maintainability".
| spaceman_2020 wrote:
| If a junior engineer ships a similar repo to this with the
| help of AI, sure, I'll buy that.
|
| But as of now, it's senior engineers who really know what
| they 're doing who can spot the errors in AI code.
| tptacek wrote:
| Hold on. You said "really know what they're doing". Yes,
| I agree with that. What I don't buy is the coupling of
| that concept with "seniority".
| danielbln wrote:
| Have a better term for "knows what they're doing" other
| than senior?
| tptacek wrote:
| That's not what "senior" means.
| danielbln wrote:
| Maybe you could enlighten the rest of us then. According
| to your favorite definition, what does senior mean, what
| does seniority mean, and what's a term for someone who
| knows what they're doing?
| tptacek wrote:
| Seniority means you've held the role for a long time.
| etothet wrote:
| This is not necessarily true in practical terms when it
| comes to hiring or promoting. Often a senior dev becomes
| a senior because of having an advanced skillset, despite
| years on the job. Similarily, often developers who have
| been on the job for many years aren't ready for senior
| because of their lack or soft and hard skills.
| tptacek wrote:
| Oh, that's _one_ of the ways a senior dev becomes senior.
| latexr wrote:
| > It just blurs the line between engineer and tool.
|
| I realise you meant it as "the engineer and their tool blend
| together", but I read it like a funny insult: "that guy likes
| to think of himself as an engineer, but he's a complete tool".
| visarga wrote:
| > prompt engineers point at them as evidence of the fast-
| approaching software engineer obsolescence
|
| Maybe journalists and bloggers angling for attention do it,
| prompt engineers are too aware of the limitations of prompting
| to do that.
| thegrim33 wrote:
| I mean yeah, the very first prompt given to the AI was put
| together by an experienced developer; a bunch of code telling
| the AI exactly what the API should look like and how it would
| be used. The very first step in the process already required an
| experienced developer to be involved.
| thorum wrote:
| Humorous that this article has a strong AI writing smell - the
| author should publish the prompts they used!
| dcre wrote:
| I don't like to accuse, and the article is fine overall, but
| this stinks: "This transparency transforms git history from a
| record of changes into a record of intent, creating a new form
| of documentation that bridges human reasoning and machine
| implementation."
| keybored wrote:
| > I don't like to accuse, and the article is fine overall,
| but this stinks:
|
| Now consider your reasonable instinct to not accuse other
| people coupled with the possibility setting AI lose with
| "write a positive article about AI where you have some
| paragraphs about the current limitations based on this link.
| write like you are just following the evidence." Meanwhile we
| are supposed to sit here and weigh every word.
|
| This reminds to write a prompt for a blogpost. How AI could
| be used for making personal-looking tech-guy who meditates
| and runs websites. (Do we have the technology? Yes we do)
| ZephyrBlu wrote:
| Also: " _This OAuth library represents something larger than
| a technical milestone--it 's evidence of a new creative
| dynamic emerging_"
|
| Em-dash baby.
| latexr wrote:
| Can we please stop using the em-dash as a metric to
| "detect" LLM writing? It's lazy and wrong. Plenty of people
| use em-dashes, _it's a useful punctuation mark_. If humans
| didn't use them, they wouldn't be in the LLM training data.
|
| There are better clues, like the kind of vague pretentious
| babble bad marketers use to make their products and ideas
| seem more profound than they are. It's a type of bad
| writing which looks grandiose but is ultimately meaningless
| and that LLMs heavily pick up on.
| grey-area wrote:
| _Very_ few people use n dashes in internet writing as
| opposed to dashes as they are not available on the
| default keyboard.
| latexr wrote:
| That's not true at all. Apple's OS by default have smart
| punctuation enabled and convert -- (two hyphens) into --
| ("em-dash"; _not_ an "en-dash", which has a different
| purpose), " " (dumb quotes) into " " (smart quotes), and
| so forth.
|
| Furthermore, on macOS there are simple key combinations
| (e.g. with [?]) to make all sort of smart punctuation
| even if you don't have the feature enabled by default,
| and on iOS you can long press on a key (such as the
| hyphen) to see alternates.
|
| The majority of people may not use correct punctuation
| marks, but enough do that assuming a single character
| immediately means they used an LLM is just plain wrong. I
| have _never_ used an LLM to write a blog post, internet
| comment, or anything of the sort, and I have used smart
| punctuation in all my writing for over a decade. Same
| with plenty of other HN commenters, journalists, writers,
| editors, and on and on. You don't need to be a literal
| machine to care about correct character use.
| grey-area wrote:
| So we've established the default is a hyphen, not an em
| dash.
|
| You can certainly select an em dash but most don't know
| what it means and don't use it.
|
| It's certainly not infallible proof but multiple uses of
| it in comments online (vs published material or
| newspapers) are very unusual, so I think it's an
| interesting indicator. I completely agree it is common in
| some texts, usually ones from publishing houses with
| style guides but also people who know about writing or
| typography.
| thoroughburro wrote:
| On the "default keyboard" of most people (a phone), you
| just long-press hyphen to choose any dash length.
| grey-area wrote:
| But who does? Not many.
| purplesyringa wrote:
| This is a post _with formatting_ and we 're programmers
| here. I can assure you their editor (or Markdown)
| supports em-dash in some fashion.
| ZeroTalent wrote:
| I have used Em-dashes in many of my comments for years.
| It's just a result of reading books, where Em-dashes happen
| a lot.
| mpalmer wrote:
| The sentence itself is a smeLLM. Grandiose pronouncements
| aren't a bot exclusive, but man do they love making them,
| especially about creative paradigms and dynamics
| OjotCewIo wrote:
| > this stinks: "This transparency transforms git history from
| a record of changes into a record of intent, creating a new
| form of documentation that bridges human reasoning and
| machine implementation."
|
| That's where I stopped reading. If they needed "AI" for
| turning their git history into a record of intent
| ("transparency"), then they had been doing it all wrong,
| previously. Git commit messages have _always_ been a "form
| of documentation that bridges human reasoning" -- namely,
| with another human's (the reader's) reasoning.
|
| If you don't walk your reviewer through your patch, in your
| commit message, as if you were _teaching_ them, then you 're
| doing it wrong.
|
| Left a bad taste in my mouth.
| maxemitchell wrote:
| I did human notes -> had Claude condense and edit -> manually
| edit. A few of the sentences (like the stinky one below) were
| from Claude which I kept if it matched my own thoughts, though
| most were changed for style/prose.
|
| I'm still experimenting with it. I find it can't match style at
| all, and even with the manual editing it still "smells like AI"
| as you picked up. But, it also saves time.
|
| My prompt was essentially "here are my old blog posts, here's
| my notes on reading a bunch of AI generated commits, help me
| condense this into a coherent article about the insights I
| learned"
| fpgaminer wrote:
| I used almost 100% AI to build a SCUMM-like parser, interpreter,
| and engine (https://github.com/fpgaminer/scumm-rust). It was a
| fun workflow; I could generally focus on my usual work and just
| pop in occasionally to check on and direct the AI.
|
| I used a combination of OpenAI's online Codex, and Claude Sonnet
| 4 in VSCode agent mode. It was nice that Codex was more automated
| and had an environment it could work in, but its thought-logs are
| terrible. Iteration was also slow because it takes awhile for it
| to spin the environment up. And while you _can_ have multiple
| requests running at once, it usually doesn't make sense for a
| single, somewhat small project.
|
| Sonnet 4's thoughts were much more coherent, and it was fun to
| watch it work and figure out problems. But there's something
| broken in VSCode right now that makes its ability to read console
| output inconsistent, which made things difficult.
|
| The biggest issue I ran into is that both are set up to seek out
| and read only small parts of the code. While they're generally
| good at getting enough context, it does cause some degradation in
| quality. A frequent issue was replication of CSS styling between
| the Rust side of things (which creates all of the HTML elements)
| and the style.css side of things. Like it would be working on the
| Rust code and forget to check style.css, so it would just
| manually insert styles on the Rust side even though those
| elements were already styled on the style.css side.
|
| Codex is also _terrible_ at formatting and will frequently muck
| things up, so it's mandatory to use it with an autoformatter and
| instructions to use it. Even with that, Codex will often say that
| it ran it, but didn't actually run it (or ran it somewhere in the
| middle instead of at the end) so its pull requests fail CI.
| Sonnet never seemed to have this issue and just used the
| prevailing style it saw in the files.
|
| Now, when I say "almost 100% AI", it's maybe 99% because I did
| have to step in and do some edits myself for things that both
| failed at. In particular neither can see the actual game running,
| so they'd make weird mistakes with the design. (Yes, Sonnet in VS
| Code can see attached images, and potentially can see the DOM of
| vscode's built in browser, but the vision of all SOTA models is
| ass so it's effectively useless). I also stepped in once to do
| one major refactor. The AIs had decided on a very strange, messy,
| and buggy interpreter implementation at first.
| eviks wrote:
| > Imagine version control systems where you commit the prompts
| used to generate features rather than the resulting
| implementation.
|
| So every single run will result in different non-reproducible
| implementation with unique bugs requiring manual expert
| interventions. How is this better?
| cosmok wrote:
| I have documented my experience using an Agent for a slightly
| different task -- upgrading framework version -- I had to abandon
| the work, but, my learning has been similar what is in the post.
|
| https://www.trk7.com/blog/ai-agents-for-coding-promise-vs-re...
| never_inline wrote:
| > Don't be afraid to get your hands dirty. Some bugs and styling
| issues are faster to fix manually than to prompt through. Knowing
| when to intervene is part of the craft.
|
| This has been my experience as well. to always run the cli tool
| in the bottom pane of an IDE and not in a standalone terminal.
| brador wrote:
| Many of you are failing to conprehend the potential scale of AI
| generated codebases.
|
| Take note - there is no limit. Every feature you or the AI can
| prompt can be generated.
|
| Imagine if you were immortal and given unlimited storage. Imagine
| what you could create.
|
| That's a prompt away.
|
| Even now you're still restricting your thinking to the old ways.
| latexr wrote:
| You're sounding like a religious zealot recruiting for a cult.
|
| No, it is not possible to prompt every feature, and I suspect
| people who believe LLMs can accurately program anything in any
| language are frankly not solving any truly novel or interesting
| problems, because if they were they'd see the obvious cracks.
| nojito wrote:
| > I suspect people who believe LLMs can accurately program
| anything in any language are frankly not solving any truly
| novel or interesting problems, because if they were they'd
| see the obvious cracks.
|
| The vast majority of problems in programming aren't novel or
| interesting.
| politelemon wrote:
| > That's a prompt away.
|
| Currently, it's 6 prompts away in which 5 of those are me
| guiding the LLM to output the answer that I already have in
| mind.
| _lex wrote:
| You're talking ahead of the others in this thread, who do not
| understand how you got to what you're saying. I've been doing
| research in this area. You are not only correct, but the
| implications are staggering, and go further than what you have
| mentioned above. This is no cult, it is the reorganization of
| the economics of work.
| OjotCewIo wrote:
| > it is the reorganization of the economics of work
|
| and the overwhelming majority of humanity will be worse off
| for it
| UltraSane wrote:
| I was thinking that if you had a good enough verified
| mathematical model of your code using TLA+ or similar you could
| then use an LLM to generate your code in any language and be
| confident it is correct. This would be Declarative Programming.
| Instead of putting in a lot of work writing code that MIGHT do
| what you intend you put more work into creating the verified
| model and then the LLM generates code that will do what the model
| intends.
| kookamamie wrote:
| > Treat prompts as version-controlled assets
|
| This only works if the model and its context are immutable. None
| of us really control the models we use, so I'd be sceptical about
| reproducing the artifacts later.
| lmeyerov wrote:
| If/when to commit prompts has been fascinating as we have been
| doing similarly to build Louie.ai. I now have several categories
| with different handling:
|
| - Human reviewed: Code guidelines and prompt templates are
| essentially dev tool infra-as-code and need review
|
| - Discarded: Individual prompt commands I write, and
| implementation plan progress files the AI write, both get
| trashed, and are even part of my .gitignore . They were kept by
| Cloudflare, but we don't keep these.
|
| - Unreviewed: Claude Code does not do RAG in the usual sense, so
| it is on us to create guides for how we do things like use big
| frameworks. They are basically indexes for speeding up AI with
| less grepping + hallucinating across memory compactions. The AI
| reads and writes these, and we largely stay out of it.
|
| There are weird cases I am still trying to figure out. Ex:
|
| - feature impl might start with an AI coming up with the product
| spec, so having that maintained as the AI progresses and
| committed in is a potentially useful artifact
|
| - how prompt templates get used is helpful for their automated
| maintenance.
| Fischgericht wrote:
| So, it means that you and the LLM together have managed to write
| SEVEN lines of trivial code per hour. On a protocol that is
| perfectly documented, where you can look at about one million
| other implementations when in doubt.
|
| It is not my intention to hurt your feelings, but it sounds like
| you and/or the LLM are not really good at their job. Looking at
| programmer salaries and LLM energy costs, this appears to be a
| very very VERY expensive OAuth library.
|
| Again: Not my intention to hurt any feelings, but the numbers
| really are shockingly bad.
| Fischgericht wrote:
| Yes, my brain got confused on who wrote the code and who just
| reported about it. I am truly sorry. I will go see my LLM
| doctor to get my brain repaired.
| kentonv wrote:
| I spent about 5 days semi-focused on this codebase (though I
| always have lots of people interrupting me all the time). It's
| about 5000 lines (if you count comments, tests, and
| documentation, which you should). Where do you get 7 lines per
| hour?
| nojito wrote:
| >So, it means that you and the LLM together have managed to
| write SEVEN lines of trivial code per hour.
|
| Here's their response
|
| >It took me a few days to build the library with AI.
|
| >I estimate it would have taken a few weeks, maybe months to
| write by hand.
|
| >That said, this is a pretty ideal use case: implementing a
| well-known standard on a well-known platform with a clear API
| spec.
|
| https://news.ycombinator.com/item?id=44160208
|
| Lines of code per hour is a terrible metric to use.
| Additionally, it's far easier to critique code that's already
| written!
| moron4hire wrote:
| I'm sorry, this all sounds like a fucking miserable experience.
| Like, if this is what my job becomes, I'll probably quit tech
| completely.
| kentonv wrote:
| That's exactly what I thought, too, before I tried it!
|
| Turns out it feels very different than I expected. I really
| recommend trying it rather than assuming. There's no learning
| curve, you just install Claude Code and run it in your repo and
| ask it for things.
|
| (I am the author of the code being discussed. Or, uh, the
| author of the prompts at least.)
| Arainach wrote:
| >Around the 40-commit mark, manual commits became frequent
|
| This matches my experience: some shiny (even sometimes
| impressive) greenspace demos but dramatically less useful
| maintaining a codebase - which for any successful product is 90%
| of the work.
| Lerc wrote:
| > _Treat prompts as version-controlled assets. Including prompts
| in commit messages creates valuable context for future
| maintenance and debugging._
|
| I think this is valuable data, but it is also out of distribution
| data. Prior to AI models writing code, this won't be present in
| the training set. Additional training will probably be needed to
| correlate better results with the new input stream, and also to
| learn that some of the records would be of its own unreliability
| and to develop a healthy scepticism of what it has said in the
| past.
|
| There's a lot of talk about model collapse with models training
| purely on their own output, or AI slop infecting training data
| sets, but ultimately it is all data. Combined with a signal to
| say which bits were ultimately beneficial, it can all be put to
| use. Even the failures can provide a good counterfactual signal
| for constrastive learning.
| _pdp_ wrote:
| I commented on the original discussion a few days ago but I will
| do it again.
|
| Why is this such a big deal? This library is not even that
| interesting. It is very straightforward task I expect most
| programers will be able to pull off easily. 2/3 of the code is
| type interfaces and comments. The rest is by book implementation
| of a protocol that is not even that complex.
|
| Please, there are some React JSX files in your code base with a
| lot more complexities and intricacies than this.
|
| Has anyone even read the code at all?
| axi0m wrote:
| >> what if we treated prompts as the actual source code?
|
| And they probably will be. Looks like prompts have become the new
| higher-level coding language, the same way JavaScript is a human-
| friendly abstraction of an existing programming language (like
| C), which is already a more accessible way to write assembly
| itself, and the same goes for the underlying binary code... I
| guess we eventually reached the final step in the development
| chain, bridging the gap between hardware instructions and human
| language.
| starkparker wrote:
| > Almost every feature required multiple iterations and
| refinements. This isn't a limitation--it's how the collaboration
| works.
|
| I guess that's where a big miss in understanding so much of the
| messaging about generative AI in coding happens for me, and why
| the Fly.io skepticism blog post irritated me so much as well.
|
| It _is_ how collaboration with a person works, but the when you
| have to fix the issues that the tool created, you aren't
| collaborating with a person, you're making up for a broken tool.
|
| I can't think of any field where I'd be expected to not only put
| up with, but also celebrate, a tool that screwed up and required
| manual intervention so often.
|
| The level of anthropomorphism that occurs in order to advocate on
| behalf of generative AI use leads to saying things like "it's how
| collaboration works" here, when I'd never say the same thing
| about the table saw in my woodshop, or even the relatively smart
| cruise control on my car.
|
| Generative AI is still just a tool built by people following a
| design, and which purportedly makes work easier. But when my saw
| tears out cuts that I have to then sand or recut, or when my car
| slams on the brakes because it can't understand a bend in the
| road around a parking lane, I don't shrug and ascribe them human
| traits and blame myself for being frustrated over how they
| collaborate with me.
___________________________________________________________________
(page generated 2025-06-07 23:01 UTC)