[HN Gopher] Test-driven development with an LLM for fun and profit
       ___________________________________________________________________
        
       Test-driven development with an LLM for fun and profit
        
       Author : crazylogger
       Score  : 116 points
       Date   : 2025-01-16 15:30 UTC (7 hours ago)
        
 (HTM) web link (blog.yfzhou.fyi)
 (TXT) w3m dump (blog.yfzhou.fyi)
        
       | simonw wrote:
       | Here's the Go app described in the post:
       | https://github.com/yfzhou0904/tdd-with-llm-go
       | 
       | Example usage from that README (and the blog post):
       | % go run main.go \       --spec 'develop a function to take in a
       | large text, recognize and parse any and all ipv4 and ipv6
       | addresses and CIDRs contained within it (these may be surrounded
       | by random words or symbols like commas), then return them as a
       | list' \       --sig 'func ParseCidrs(input string) ([]*net.IPNet,
       | error)'
       | 
       | The all important prompts it uses are in
       | https://github.com/yfzhou0904/tdd-with-llm-go/blob/main/prom...
        
       | picografix wrote:
       | very few times we are encountered with developing from scratch
        
       | xianshou wrote:
       | One trend I've noticed, framed as a logical deduction:
       | 
       | 1. Coding assistants based on o1 and Sonnet are pretty great at
       | coding with <50k context, but degrade rapidly beyond that.
       | 
       | 2. Coding agents do massively better when they have a test-driven
       | reward signal.
       | 
       | 3. If a problem can be framed in a way that a coding agent can
       | solve, that speeds up development at least 10x from the base case
       | of human + assistant.
       | 
       | 4. From (1)-(3), if you can get all the necessary context into
       | 50k tokens and measure progress via tests, you can speed up
       | development by 10x.
       | 
       | 5. Therefore all new development should be microservices written
       | from scratch and interacting via cleanly defined APIs.
       | 
       | Sure enough, I see HN projects evolving in that direction.
        
         | Arcuru wrote:
         | > 5. Therefore all new development should be microservices
         | written from scratch and interacting via cleanly defined APIs.
         | 
         | Not necessarily. You can get the same benefits you described in
         | (1)-(3) by using clearly defined modules in your codebase, they
         | don't need to be separate microservices.
        
           | sdesol wrote:
           | Agreed. If the microservice does not provide any value from
           | being isolated, it is just a function call with extra steps.
        
             | __MatrixMan__ wrote:
             | I think the argument is that the extra value provided is a
             | small enough context window for working with an LLM.
             | Although I'd suggest making it a library if one can manage,
             | that gives you the desired context reduction bounded by
             | interfaces without taking on the complexities of adding an
             | additional microservice.
             | 
             | I imagine throwing a test at an LLM and saying:
             | 
             | > hold the component under test constant (as well as the
             | test itself), and walk the versions of the library until
             | you can tell me where they're compatible and where they
             | break.
             | 
             | If you tried to do that with a git bisect and everything in
             | the same codebase, you'd end up varying all three (test,
             | component, library) which is worse science than holding two
             | constant and varying the third would be.
        
         | whoisnnamdi wrote:
         | This is a helpful breakdown of a trend, thank you
         | 
         | Might be a boon for test-driven development. Could turn out
         | that AI coding is the killer app for TDD. I had a similar
         | thought about a year ago but had forgotten, appreciate the
         | reminder
        
         | Swizec wrote:
         | > 5. Therefore all new development should be ~~microservices~~
         | modules written from scratch and interacting via cleanly
         | defined APIs.
         | 
         | We figured this out for humans almost 20 years ago. Some really
         | good empirical research. It's the only approach to large scale
         | software development that works.
         | 
         | But it requires leadership that gives a shit about the quality
         | of their product and value long-term outcomes over short-term
         | rewards.
        
           | p1necone wrote:
           | By large scale do you mean large software or large amounts of
           | developers? Because there's some absolutely massive software
           | in terms of feature set, usefulness and even LoC (not that
           | that is a useful measurement) etc out there made by very
           | small teams.
           | 
           | I'm not sure that you've got the causal relationship the
           | right way around here re: architecture:team size.
        
             | Swizec wrote:
             | What does team size have to do with this? Small teams can
             | (and should) absolutely build modularized software ...
             | 
             | You simply cannot build a [working/maintainable] large
             | piece of software if everything is connected to everything
             | and any one change may cause issues in conceptually
             | unrelated pieces of code. As soon as your codebase is
             | bigger than what you can fully memorize, you need modules,
             | separation of concerns, etc.
        
         | sdesol wrote:
         | > you can speed up development by 10x.
         | 
         | If you know what you are doing, then yes. If you are a domain
         | expert and can articulate your thoughts clearly in a prompt,
         | you will most likely see a boost--perhaps two to three times--
         | but ten times is unlikely. And if you don't fully understand
         | the problem, you may experience a negative effect.
        
           | throwup238 wrote:
           | I think it also depends on how much yak-shaving is involved
           | in the domain, regardless of expertise. Whether that's
           | something simple like remembering the right bash incantation
           | or something more complex like learning enough Terraform and
           | providers to be able to spin up cloud infrastructure.
           | 
           | Some projects just have a lot of stuff to do around the edges
           | and LLMs excel at that.
        
         | swatcoder wrote:
         | > 3. If a problem can be framed in a way that a coding agent
         | can solve...
         | 
         | This reminds me of the South Park underwear gnomes. You picked
         | a tool and set an expectation, then just kind of hand wave over
         | the hard part in the middle, as though framing problems "in a
         | way coding agents can solve" is itself a well-understood or
         | bounded problem.
         | 
         | Does it sometimes take 50x effort to understand a problem and
         | the agent well enough to get that done? Are there classes of
         | problems where it can't be done? Are either of those concerns
         | something you can recognize before they impact you? At
         | commercial quality, is it an accessible skill for inexperienced
         | people or do you need a mastery of coding, the problem domain,
         | or the coding agent to be able to rely on it? Can teams recruit
         | people who can reliable achieve any of this? How expensive is
         | _that_ talent? etc
        
           | emptiestplace wrote:
           | We've had failed projects since long before LLMs. I think
           | there is a tendency for people to gloss over this (3.)
           | regardless, but working with an LLM it tends to become
           | obvious much more quickly, without investing tens/hundreds of
           | person-hours. I know it's not perfect, but I find a lot of
           | the things people complain about would've been a problem
           | either way - especially when people think they are going to
           | go from 'hello world' to SaaS-billionaire in an hour.
           | 
           | I think mastery of the problem domain is still important, and
           | until we have effectively infinite context windows (that work
           | perfectly), you will need to understand how and when to
           | refactor to maximize quality and relevance of data in
           | context.
        
             | dingnuts wrote:
             | well according to xianshou's profile they work in finance
             | so it makes sense to me that they would gloss over the hard
             | part of programming when describing how AI is going to
             | improve it
        
               | ziddoap wrote:
               | Working in one domain does not preclude knowledge of
               | others. I work in cybersec but spent my first working
               | decade in construction estimation for institutional
               | builds. I can talk confidently about firewalls or the
               | hospital you want to build.
               | 
               | No need to make assumptions based on a one-line hacker
               | news profile.
        
           | hitchstory wrote:
           | >as though framing problems "in a way coding agents can
           | solve" is itself a well-understood or bounded problem.
           | 
           | It's not, but if you can A) make it cheap to try out
           | different types of framings and some of them work and B)
           | automate everything else then the labor intensity of
           | programming decreases drastically.
           | 
           | >At commercial quality, is it an accessible skill for
           | inexperienced people
           | 
           | I'd expect the opposite, it would be an extremely
           | inaccessible skill requiring high skill and high pay. But, if
           | 2 people can get as much done as 15 people at a higher
           | quality and they're paid triple, it's still way cheaper
           | overall.
           | 
           | I would still expect somebody following this development
           | pattern to routinely discover a problem the LLM can't deal
           | with and have to dive under the hood to fix it. This would be
           | _Hard_ with a capital H.
        
         | steeeeeve wrote:
         | So having clear requirements, a focused purpose for software,
         | and a clear boundary of software responsibility makes for a
         | software development task that can be accomplished?
         | 
         | If only people had figured out at some point that the same
         | thing applies when communicating to human software engineers.
        
         | phaedrus wrote:
         | 50K context is an interesting number because I think there's a
         | lot to explore with software within an order of magnitude that
         | size. With apologies to Richard Feynman, I call it, "There's
         | plenty of room in the middle." My idea there is the rapid
         | expansion of computing power during the reign of Moore's law
         | left the design space of "medium sized" programs under-
         | explored. These would be programs in the range of 100's of
         | kilobytes to low megabytes.
        
       | smusamashah wrote:
       | On a similar note, has anyone found themselves absolutely not
       | trusting non-code LLM output?
       | 
       | The code is at least testable and verifiable. For everything else
       | I am left wondering if it's the truth or a hallucination. It
       | incurs more mental burden that I was trying to avoid using LLM in
       | the first place.
        
         | joshstrange wrote:
         | Absolutely. LLMs are a "need to verify" the results almost
         | always. LLMs (for me) shine by pointing me in the right
         | direction, getting a "first draft", or for things like code
         | where I can test it.
        
         | nyrikki wrote:
         | It is really the only safe way to use it IMHO.
         | 
         | Even in most simple forms of automation, humans suffer from
         | Automation Bias and Complacency and one of the better ways to
         | avoid those issues is to instill a fundamental mistrust of
         | those systems.
         | 
         | IMHO it is important to look at other fields and the human
         | factors studies to understand this.
         | 
         | As an example ABS was originally sold as a technology that
         | would help you 'stop faster'. Which it may do in some
         | situations, and it is obviously mandatory in the US. But they
         | had to shift how they 'sell' it now, to ensure that people
         | didn't rely on it.
         | 
         | https://www.fmcsa.dot.gov/sites/fmcsa.dot.gov/files/docs/200...
         | 2.18 - Antilock Braking Systems (ABS)              ABS is a
         | computerized system that keeps your wheels from locking up
         | during hard brake applications.         ABS is an addition to
         | your normal brakes. It does not decrease or increase your
         | normal braking capability. ABS only activates when wheels are
         | about to lock up.         ABS does not necessarily shorten your
         | stopping distance, but it does help you keep the vehicle under
         | control during hard braking.
         | 
         | Transformers will _always_ produce code that doesn 't work, it
         | doesn't matter if that is due to what they call hallucinations,
         | Rice's theory, etc...
         | 
         | Maintaining that mistrust is the mark of someone who
         | understands and can leverage the technology. It is just yet
         | another context specific tradeoff analysis that we will need to
         | assess.
         | 
         | I think forcing people into the quasi-TDD thinking model, where
         | they focus on what needs to be done first vs jumping into the
         | implementation details will probably be a positive thing for
         | the industry, no matter where on the spectrum LLM coding
         | assistants arrive.
         | 
         | That is one of the hardest things to teach when trying to
         | introduce TDD, focusing on what is far closer to an ADT than
         | implementation specific unit tests to begin with is very
         | different but very useful.
         | 
         | I am hopeful that required tacit experience will help get past
         | the issues with formal frameworks that run into many barriers
         | that block teaching that one skill.
         | 
         | As LLM's failure mode is _Always Confident, Often Competent,
         | and Inevitably Wrong_ , it is super critical to always realize
         | the third option is likely and that you are the expert.
        
         | sdesol wrote:
         | > On a similar note, has anyone found themselves absolutely not
         | trusting non-code LLM output?
         | 
         | I'm working on a LLM chat app that is built around mistrust.
         | The basic idea is that it is unlikely a supermajority of
         | quality LLMs can get it wrong.
         | 
         | This isn't foolproof though, but it does provide some level of
         | confidence in the answer.
         | 
         | Here is a quick example in which I analyze results from
         | multiple LLMs that answered, "When did Homer Simpson go to
         | Mars?"
         | 
         | https://beta.gitsense.com/?chat=4d28f283-24f4-4657-89e0-5abf...
         | 
         | If you look at the yes and no table, all except GPT-4o and
         | GPT-4o mini said no. After asking GPT-4o who was correct, it
         | provided "evidence" on an episode so I asked for more
         | information on that episode. Based on what it said, it looks
         | like the mission to Mars was a hoax and when I challenged
         | GPT-4o on this, it agreed and said Homer never went to Mars,
         | like others have said.
         | 
         | I then asked Sonnet 3.5 about the episode and it said GPT-4o
         | misinterpreted the plot.
         | 
         | https://beta.gitsense.com/?chat=4d28f283-24f4-4657-89e0-5abf...
         | 
         | At this point, I am confident (but not 100% sure) Homer never
         | went to Mars and if I really needed to know, I'll need to
         | search the web.
        
           | manmal wrote:
           | Relatedly, asking LLMs what happens in a TV episode, or a
           | series in general, I usually get very low quality and mostly
           | flat out wrong answers. That baffles me, as I thought there
           | are multiple well structured synopses for any TV series in
           | the training data.
        
           | horsawlarway wrote:
           | Isn't this essentially making the point of the post above
           | you?
           | 
           | For comparison - if I just do a web search for "Did homer
           | simpson go to mars" I get immediately linked to the wikipedia
           | page for that exact episode
           | (https://en.wikipedia.org/wiki/The_Marge-ian_Chronicles), and
           | the plot summary is less to read than your LLM output - It
           | clearly summarizes that Marge & Lisa (note - NOT homer)
           | almost went to mars, but did not go. Further - the summary
           | correctly includes the outro which _does_ show Marge and Lisa
           | on mars in the year 2051.
           | 
           | Basically - for factual content, the LLM output was a garbage
           | game of telephone.
        
         | Marceltan wrote:
         | Agree. My biggest pain point with LLM code review tools is that
         | they sometimes add 40 comments for a PR changing 100 lines of
         | code. Gets noisy and hard to decipher what really matters.
         | 
         | Along the lines of verifiability, my take is that running a
         | comprehensive suite of tests in CI/CD is going to be table
         | stakes soon given that LLMs are only going to be contributing
         | more and more code.
        
         | redcobra762 wrote:
         | You're going to fall behind eventually, if you continue to
         | treat LLMs with this level of skepticism, as others won't, and
         | the output is accurate enough that it can be useful to improve
         | the efficiency of work in a great many situations.
         | 
         | Rarely are day-to-day written documents (e.g. an email asking
         | for clarification on an issue or to schedule an appointment) of
         | such importance that the occasional error is unforgivable. In
         | situations where a mistake _is_ fatal, yes I would not trust
         | GenAI. But how many of us really work in that kind of a field?
         | 
         | Besides, AI shines when used for creative purposes. Coming up
         | with new ideas or rewording a paragraph for clarity isn't
         | something one does blindly. GenAI is a coworker, not an
         | authority. It'll generate a draft, I may edit that draft or
         | rewrite it significantly, but to preclude it because it _could_
         | error will eventually slow you down in your field.
        
         | iamnotagenius wrote:
         | Yes, it is good for suumarizing existing text, explaining
         | something or coding; in short any generative/transformative
         | tasks. Not good for information retrieval. Having said that
         | even tiny Qwen 3b/7b coding llms turned out to be very useful
         | in my use experience.
        
         | energy123 wrote:
         | We need a hallucination benchmark.
         | 
         | My experience is, o1 is very good at avoiding hallucinations
         | and I trust it more, but o1-mini and 4o is awful.
        
       | blopker wrote:
       | In Rust, there's a controversial practice around putting unit
       | tests in the same file as the actual code. I was put off by it at
       | first, but I'm finding LLM autocomplete is able to be much more
       | effective just being able to see the tests.
       | 
       | No clunky loop needed.
       | 
       | It's gotten me back into TDD.
        
         | sitkack wrote:
         | If the LLM can't complete a task, you add a test the shows it
         | how to do it. This is multishot incontext learning and
         | programming by example.
         | 
         | As for real TDD, you start with the tests and code until they
         | pass. I haven't used an LLM to do this in Rust yet, but in
         | Python due its dynamic nature, it is much simpler.
         | 
         | You can write the tests, then have the LLM sketch the code out
         | enough so that they pass or at least exist enough to pass a
         | linter. Dev tools are going to feel like magic 18 months from
         | now.
        
       | eesmith wrote:
       | > recognize and parse any and all ipv4 and ipv6 addresses and
       | CIDRs contained within it (these may be surrounded by random
       | words or symbols like commas), then return them as a list'
       | 
       | Did I miss the generated code and test cases? I would like to see
       | how complete it was.
       | 
       | For example, for IPv4 does it only handle quad-dotted IP
       | addresses, or does it also handle decimal and hex formats?
       | 
       | For that matter, _should_ it handle those, and if so, where there
       | clarification of what exactly  'all ipv4 ... addresses' means?
       | 
       | I can think of a lot of tricky cases (like 1.2.3.4.5 and 3::2::1
       | as invalid cases, or http://[2001:db8:4006:812::200e] to test for
       | "symbols like commas"), and would like to see if the result
       | handles them.
        
       | agentultra wrote:
       | This is not a good idea.
       | 
       | If you want better tests with more cases exercising your code:
       | write property based tests.
       | 
       | Tests form an executable, informal specification of what your
       | software is supposed to do. It should absolutely be written by
       | hand, by a human, for other humans to use and understand. Natural
       | language is not precise enough for even informal specifications
       | of software modules, let alone software systems.
       | 
       | If using LLM's to help you write the code is your jam, I can't
       | stop you, but at least write the tests. They're more important.
       | 
       | As an aside, I understand how this antipathy towards TDD
       | develops. People write unit tests, after writing the
       | implementation, because they see it as boilerplate code that
       | mirrors what the code they're testing already does. They're
       | missing the point of what makes a good test useful and
       | sufficient. I would not expect generating more tests of this
       | nature is going to improve software much.
       | 
       |  _Edit_ added some wording for clarity
        
         | ozten wrote:
         | I got massive productivity gains from having an LLM fill out my
         | test suite.
         | 
         | It is like autocomplete and macros... "Based on these two unit
         | tests, fill out the suite considering b, c, and d. Add any
         | critical corner case tests I have missed or suggest them if
         | they don't fit well."
         | 
         | It is on the human to look at the generated test to ensure a)
         | they are comprehensive and b) useful and c) communicate clearly
        
           | lifeisstillgood wrote:
           | Can you extend that - what was the domain, how did you start?
           | I would like to give this a try but am not quite sure I get
           | it?
        
             | ozten wrote:
             | Backend coding for web services.
             | 
             | In the past I would hand write 8 or 9 unit tests. Now I
             | write the first one or two and then brain dump anything
             | else into the LLM prompt. It then outputs mine plus 6 or
             | more.
             | 
             | I delete any that seem low value or ridiculous or have a
             | follow up prompt to ask for refinements. Then just
             | copy/pasta back into the codebase out of the chat.
        
       | zephraph wrote:
       | Hey, yeah, this is a fun idea. I built a little toy llm-tdd loop
       | as a Saturday morning side project a little while back:
       | https://github.com/zephraph/llm-tdd.
       | 
       | This doesn't actually work out that well in practice though
       | because the implementations the llm tended to generate were
       | highly specific to pass the tests. There were several times it
       | would cheat and just return hard coded strings that matched the
       | expects of the tests. I'm sure better prompt engineering could
       | help, but it was a fairly funny outcome.
       | 
       | Something I've found more valuable is generating the tests
       | themselves. Obviously you don't wholesale rely on what's
       | generated. Tests can have a certain activation energy just to
       | figure out how to set up correctly (especially if you're in a new
       | project). Having an LLM take a first pass at it and then ensuring
       | it's well structured and testing important codepaths instead of
       | implementation details makes it a lot faster to write tests.
        
       | mmikeff wrote:
       | Writing a whole load of tests up front and then coding until all
       | the tests pass is not TDD.
        
       ___________________________________________________________________
       (page generated 2025-01-16 23:00 UTC)