[HN Gopher] Test-driven development with an LLM for fun and profit
___________________________________________________________________
Test-driven development with an LLM for fun and profit
Author : crazylogger
Score : 116 points
Date : 2025-01-16 15:30 UTC (7 hours ago)
(HTM) web link (blog.yfzhou.fyi)
(TXT) w3m dump (blog.yfzhou.fyi)
| simonw wrote:
| Here's the Go app described in the post:
| https://github.com/yfzhou0904/tdd-with-llm-go
|
| Example usage from that README (and the blog post):
| % go run main.go \ --spec 'develop a function to take in a
| large text, recognize and parse any and all ipv4 and ipv6
| addresses and CIDRs contained within it (these may be surrounded
| by random words or symbols like commas), then return them as a
| list' \ --sig 'func ParseCidrs(input string) ([]*net.IPNet,
| error)'
|
| The all important prompts it uses are in
| https://github.com/yfzhou0904/tdd-with-llm-go/blob/main/prom...
| picografix wrote:
| very few times we are encountered with developing from scratch
| xianshou wrote:
| One trend I've noticed, framed as a logical deduction:
|
| 1. Coding assistants based on o1 and Sonnet are pretty great at
| coding with <50k context, but degrade rapidly beyond that.
|
| 2. Coding agents do massively better when they have a test-driven
| reward signal.
|
| 3. If a problem can be framed in a way that a coding agent can
| solve, that speeds up development at least 10x from the base case
| of human + assistant.
|
| 4. From (1)-(3), if you can get all the necessary context into
| 50k tokens and measure progress via tests, you can speed up
| development by 10x.
|
| 5. Therefore all new development should be microservices written
| from scratch and interacting via cleanly defined APIs.
|
| Sure enough, I see HN projects evolving in that direction.
| Arcuru wrote:
| > 5. Therefore all new development should be microservices
| written from scratch and interacting via cleanly defined APIs.
|
| Not necessarily. You can get the same benefits you described in
| (1)-(3) by using clearly defined modules in your codebase, they
| don't need to be separate microservices.
| sdesol wrote:
| Agreed. If the microservice does not provide any value from
| being isolated, it is just a function call with extra steps.
| __MatrixMan__ wrote:
| I think the argument is that the extra value provided is a
| small enough context window for working with an LLM.
| Although I'd suggest making it a library if one can manage,
| that gives you the desired context reduction bounded by
| interfaces without taking on the complexities of adding an
| additional microservice.
|
| I imagine throwing a test at an LLM and saying:
|
| > hold the component under test constant (as well as the
| test itself), and walk the versions of the library until
| you can tell me where they're compatible and where they
| break.
|
| If you tried to do that with a git bisect and everything in
| the same codebase, you'd end up varying all three (test,
| component, library) which is worse science than holding two
| constant and varying the third would be.
| whoisnnamdi wrote:
| This is a helpful breakdown of a trend, thank you
|
| Might be a boon for test-driven development. Could turn out
| that AI coding is the killer app for TDD. I had a similar
| thought about a year ago but had forgotten, appreciate the
| reminder
| Swizec wrote:
| > 5. Therefore all new development should be ~~microservices~~
| modules written from scratch and interacting via cleanly
| defined APIs.
|
| We figured this out for humans almost 20 years ago. Some really
| good empirical research. It's the only approach to large scale
| software development that works.
|
| But it requires leadership that gives a shit about the quality
| of their product and value long-term outcomes over short-term
| rewards.
| p1necone wrote:
| By large scale do you mean large software or large amounts of
| developers? Because there's some absolutely massive software
| in terms of feature set, usefulness and even LoC (not that
| that is a useful measurement) etc out there made by very
| small teams.
|
| I'm not sure that you've got the causal relationship the
| right way around here re: architecture:team size.
| Swizec wrote:
| What does team size have to do with this? Small teams can
| (and should) absolutely build modularized software ...
|
| You simply cannot build a [working/maintainable] large
| piece of software if everything is connected to everything
| and any one change may cause issues in conceptually
| unrelated pieces of code. As soon as your codebase is
| bigger than what you can fully memorize, you need modules,
| separation of concerns, etc.
| sdesol wrote:
| > you can speed up development by 10x.
|
| If you know what you are doing, then yes. If you are a domain
| expert and can articulate your thoughts clearly in a prompt,
| you will most likely see a boost--perhaps two to three times--
| but ten times is unlikely. And if you don't fully understand
| the problem, you may experience a negative effect.
| throwup238 wrote:
| I think it also depends on how much yak-shaving is involved
| in the domain, regardless of expertise. Whether that's
| something simple like remembering the right bash incantation
| or something more complex like learning enough Terraform and
| providers to be able to spin up cloud infrastructure.
|
| Some projects just have a lot of stuff to do around the edges
| and LLMs excel at that.
| swatcoder wrote:
| > 3. If a problem can be framed in a way that a coding agent
| can solve...
|
| This reminds me of the South Park underwear gnomes. You picked
| a tool and set an expectation, then just kind of hand wave over
| the hard part in the middle, as though framing problems "in a
| way coding agents can solve" is itself a well-understood or
| bounded problem.
|
| Does it sometimes take 50x effort to understand a problem and
| the agent well enough to get that done? Are there classes of
| problems where it can't be done? Are either of those concerns
| something you can recognize before they impact you? At
| commercial quality, is it an accessible skill for inexperienced
| people or do you need a mastery of coding, the problem domain,
| or the coding agent to be able to rely on it? Can teams recruit
| people who can reliable achieve any of this? How expensive is
| _that_ talent? etc
| emptiestplace wrote:
| We've had failed projects since long before LLMs. I think
| there is a tendency for people to gloss over this (3.)
| regardless, but working with an LLM it tends to become
| obvious much more quickly, without investing tens/hundreds of
| person-hours. I know it's not perfect, but I find a lot of
| the things people complain about would've been a problem
| either way - especially when people think they are going to
| go from 'hello world' to SaaS-billionaire in an hour.
|
| I think mastery of the problem domain is still important, and
| until we have effectively infinite context windows (that work
| perfectly), you will need to understand how and when to
| refactor to maximize quality and relevance of data in
| context.
| dingnuts wrote:
| well according to xianshou's profile they work in finance
| so it makes sense to me that they would gloss over the hard
| part of programming when describing how AI is going to
| improve it
| ziddoap wrote:
| Working in one domain does not preclude knowledge of
| others. I work in cybersec but spent my first working
| decade in construction estimation for institutional
| builds. I can talk confidently about firewalls or the
| hospital you want to build.
|
| No need to make assumptions based on a one-line hacker
| news profile.
| hitchstory wrote:
| >as though framing problems "in a way coding agents can
| solve" is itself a well-understood or bounded problem.
|
| It's not, but if you can A) make it cheap to try out
| different types of framings and some of them work and B)
| automate everything else then the labor intensity of
| programming decreases drastically.
|
| >At commercial quality, is it an accessible skill for
| inexperienced people
|
| I'd expect the opposite, it would be an extremely
| inaccessible skill requiring high skill and high pay. But, if
| 2 people can get as much done as 15 people at a higher
| quality and they're paid triple, it's still way cheaper
| overall.
|
| I would still expect somebody following this development
| pattern to routinely discover a problem the LLM can't deal
| with and have to dive under the hood to fix it. This would be
| _Hard_ with a capital H.
| steeeeeve wrote:
| So having clear requirements, a focused purpose for software,
| and a clear boundary of software responsibility makes for a
| software development task that can be accomplished?
|
| If only people had figured out at some point that the same
| thing applies when communicating to human software engineers.
| phaedrus wrote:
| 50K context is an interesting number because I think there's a
| lot to explore with software within an order of magnitude that
| size. With apologies to Richard Feynman, I call it, "There's
| plenty of room in the middle." My idea there is the rapid
| expansion of computing power during the reign of Moore's law
| left the design space of "medium sized" programs under-
| explored. These would be programs in the range of 100's of
| kilobytes to low megabytes.
| smusamashah wrote:
| On a similar note, has anyone found themselves absolutely not
| trusting non-code LLM output?
|
| The code is at least testable and verifiable. For everything else
| I am left wondering if it's the truth or a hallucination. It
| incurs more mental burden that I was trying to avoid using LLM in
| the first place.
| joshstrange wrote:
| Absolutely. LLMs are a "need to verify" the results almost
| always. LLMs (for me) shine by pointing me in the right
| direction, getting a "first draft", or for things like code
| where I can test it.
| nyrikki wrote:
| It is really the only safe way to use it IMHO.
|
| Even in most simple forms of automation, humans suffer from
| Automation Bias and Complacency and one of the better ways to
| avoid those issues is to instill a fundamental mistrust of
| those systems.
|
| IMHO it is important to look at other fields and the human
| factors studies to understand this.
|
| As an example ABS was originally sold as a technology that
| would help you 'stop faster'. Which it may do in some
| situations, and it is obviously mandatory in the US. But they
| had to shift how they 'sell' it now, to ensure that people
| didn't rely on it.
|
| https://www.fmcsa.dot.gov/sites/fmcsa.dot.gov/files/docs/200...
| 2.18 - Antilock Braking Systems (ABS) ABS is a
| computerized system that keeps your wheels from locking up
| during hard brake applications. ABS is an addition to
| your normal brakes. It does not decrease or increase your
| normal braking capability. ABS only activates when wheels are
| about to lock up. ABS does not necessarily shorten your
| stopping distance, but it does help you keep the vehicle under
| control during hard braking.
|
| Transformers will _always_ produce code that doesn 't work, it
| doesn't matter if that is due to what they call hallucinations,
| Rice's theory, etc...
|
| Maintaining that mistrust is the mark of someone who
| understands and can leverage the technology. It is just yet
| another context specific tradeoff analysis that we will need to
| assess.
|
| I think forcing people into the quasi-TDD thinking model, where
| they focus on what needs to be done first vs jumping into the
| implementation details will probably be a positive thing for
| the industry, no matter where on the spectrum LLM coding
| assistants arrive.
|
| That is one of the hardest things to teach when trying to
| introduce TDD, focusing on what is far closer to an ADT than
| implementation specific unit tests to begin with is very
| different but very useful.
|
| I am hopeful that required tacit experience will help get past
| the issues with formal frameworks that run into many barriers
| that block teaching that one skill.
|
| As LLM's failure mode is _Always Confident, Often Competent,
| and Inevitably Wrong_ , it is super critical to always realize
| the third option is likely and that you are the expert.
| sdesol wrote:
| > On a similar note, has anyone found themselves absolutely not
| trusting non-code LLM output?
|
| I'm working on a LLM chat app that is built around mistrust.
| The basic idea is that it is unlikely a supermajority of
| quality LLMs can get it wrong.
|
| This isn't foolproof though, but it does provide some level of
| confidence in the answer.
|
| Here is a quick example in which I analyze results from
| multiple LLMs that answered, "When did Homer Simpson go to
| Mars?"
|
| https://beta.gitsense.com/?chat=4d28f283-24f4-4657-89e0-5abf...
|
| If you look at the yes and no table, all except GPT-4o and
| GPT-4o mini said no. After asking GPT-4o who was correct, it
| provided "evidence" on an episode so I asked for more
| information on that episode. Based on what it said, it looks
| like the mission to Mars was a hoax and when I challenged
| GPT-4o on this, it agreed and said Homer never went to Mars,
| like others have said.
|
| I then asked Sonnet 3.5 about the episode and it said GPT-4o
| misinterpreted the plot.
|
| https://beta.gitsense.com/?chat=4d28f283-24f4-4657-89e0-5abf...
|
| At this point, I am confident (but not 100% sure) Homer never
| went to Mars and if I really needed to know, I'll need to
| search the web.
| manmal wrote:
| Relatedly, asking LLMs what happens in a TV episode, or a
| series in general, I usually get very low quality and mostly
| flat out wrong answers. That baffles me, as I thought there
| are multiple well structured synopses for any TV series in
| the training data.
| horsawlarway wrote:
| Isn't this essentially making the point of the post above
| you?
|
| For comparison - if I just do a web search for "Did homer
| simpson go to mars" I get immediately linked to the wikipedia
| page for that exact episode
| (https://en.wikipedia.org/wiki/The_Marge-ian_Chronicles), and
| the plot summary is less to read than your LLM output - It
| clearly summarizes that Marge & Lisa (note - NOT homer)
| almost went to mars, but did not go. Further - the summary
| correctly includes the outro which _does_ show Marge and Lisa
| on mars in the year 2051.
|
| Basically - for factual content, the LLM output was a garbage
| game of telephone.
| Marceltan wrote:
| Agree. My biggest pain point with LLM code review tools is that
| they sometimes add 40 comments for a PR changing 100 lines of
| code. Gets noisy and hard to decipher what really matters.
|
| Along the lines of verifiability, my take is that running a
| comprehensive suite of tests in CI/CD is going to be table
| stakes soon given that LLMs are only going to be contributing
| more and more code.
| redcobra762 wrote:
| You're going to fall behind eventually, if you continue to
| treat LLMs with this level of skepticism, as others won't, and
| the output is accurate enough that it can be useful to improve
| the efficiency of work in a great many situations.
|
| Rarely are day-to-day written documents (e.g. an email asking
| for clarification on an issue or to schedule an appointment) of
| such importance that the occasional error is unforgivable. In
| situations where a mistake _is_ fatal, yes I would not trust
| GenAI. But how many of us really work in that kind of a field?
|
| Besides, AI shines when used for creative purposes. Coming up
| with new ideas or rewording a paragraph for clarity isn't
| something one does blindly. GenAI is a coworker, not an
| authority. It'll generate a draft, I may edit that draft or
| rewrite it significantly, but to preclude it because it _could_
| error will eventually slow you down in your field.
| iamnotagenius wrote:
| Yes, it is good for suumarizing existing text, explaining
| something or coding; in short any generative/transformative
| tasks. Not good for information retrieval. Having said that
| even tiny Qwen 3b/7b coding llms turned out to be very useful
| in my use experience.
| energy123 wrote:
| We need a hallucination benchmark.
|
| My experience is, o1 is very good at avoiding hallucinations
| and I trust it more, but o1-mini and 4o is awful.
| blopker wrote:
| In Rust, there's a controversial practice around putting unit
| tests in the same file as the actual code. I was put off by it at
| first, but I'm finding LLM autocomplete is able to be much more
| effective just being able to see the tests.
|
| No clunky loop needed.
|
| It's gotten me back into TDD.
| sitkack wrote:
| If the LLM can't complete a task, you add a test the shows it
| how to do it. This is multishot incontext learning and
| programming by example.
|
| As for real TDD, you start with the tests and code until they
| pass. I haven't used an LLM to do this in Rust yet, but in
| Python due its dynamic nature, it is much simpler.
|
| You can write the tests, then have the LLM sketch the code out
| enough so that they pass or at least exist enough to pass a
| linter. Dev tools are going to feel like magic 18 months from
| now.
| eesmith wrote:
| > recognize and parse any and all ipv4 and ipv6 addresses and
| CIDRs contained within it (these may be surrounded by random
| words or symbols like commas), then return them as a list'
|
| Did I miss the generated code and test cases? I would like to see
| how complete it was.
|
| For example, for IPv4 does it only handle quad-dotted IP
| addresses, or does it also handle decimal and hex formats?
|
| For that matter, _should_ it handle those, and if so, where there
| clarification of what exactly 'all ipv4 ... addresses' means?
|
| I can think of a lot of tricky cases (like 1.2.3.4.5 and 3::2::1
| as invalid cases, or http://[2001:db8:4006:812::200e] to test for
| "symbols like commas"), and would like to see if the result
| handles them.
| agentultra wrote:
| This is not a good idea.
|
| If you want better tests with more cases exercising your code:
| write property based tests.
|
| Tests form an executable, informal specification of what your
| software is supposed to do. It should absolutely be written by
| hand, by a human, for other humans to use and understand. Natural
| language is not precise enough for even informal specifications
| of software modules, let alone software systems.
|
| If using LLM's to help you write the code is your jam, I can't
| stop you, but at least write the tests. They're more important.
|
| As an aside, I understand how this antipathy towards TDD
| develops. People write unit tests, after writing the
| implementation, because they see it as boilerplate code that
| mirrors what the code they're testing already does. They're
| missing the point of what makes a good test useful and
| sufficient. I would not expect generating more tests of this
| nature is going to improve software much.
|
| _Edit_ added some wording for clarity
| ozten wrote:
| I got massive productivity gains from having an LLM fill out my
| test suite.
|
| It is like autocomplete and macros... "Based on these two unit
| tests, fill out the suite considering b, c, and d. Add any
| critical corner case tests I have missed or suggest them if
| they don't fit well."
|
| It is on the human to look at the generated test to ensure a)
| they are comprehensive and b) useful and c) communicate clearly
| lifeisstillgood wrote:
| Can you extend that - what was the domain, how did you start?
| I would like to give this a try but am not quite sure I get
| it?
| ozten wrote:
| Backend coding for web services.
|
| In the past I would hand write 8 or 9 unit tests. Now I
| write the first one or two and then brain dump anything
| else into the LLM prompt. It then outputs mine plus 6 or
| more.
|
| I delete any that seem low value or ridiculous or have a
| follow up prompt to ask for refinements. Then just
| copy/pasta back into the codebase out of the chat.
| zephraph wrote:
| Hey, yeah, this is a fun idea. I built a little toy llm-tdd loop
| as a Saturday morning side project a little while back:
| https://github.com/zephraph/llm-tdd.
|
| This doesn't actually work out that well in practice though
| because the implementations the llm tended to generate were
| highly specific to pass the tests. There were several times it
| would cheat and just return hard coded strings that matched the
| expects of the tests. I'm sure better prompt engineering could
| help, but it was a fairly funny outcome.
|
| Something I've found more valuable is generating the tests
| themselves. Obviously you don't wholesale rely on what's
| generated. Tests can have a certain activation energy just to
| figure out how to set up correctly (especially if you're in a new
| project). Having an LLM take a first pass at it and then ensuring
| it's well structured and testing important codepaths instead of
| implementation details makes it a lot faster to write tests.
| mmikeff wrote:
| Writing a whole load of tests up front and then coding until all
| the tests pass is not TDD.
___________________________________________________________________
(page generated 2025-01-16 23:00 UTC)