[HN Gopher] Semantic unit testing: test code without executing it
___________________________________________________________________
Semantic unit testing: test code without executing it
Author : alexmolas
Score : 70 points
Date : 2025-05-03 09:44 UTC (2 days ago)
(HTM) web link (www.alexmolas.com)
(TXT) w3m dump (www.alexmolas.com)
| cjfd wrote:
| Much better solution: don't write useless docstrings.
| motorest wrote:
| > Much better solution: don't write useless docstrings.
|
| Actually writing the tests is far more effective, and doesn't
| require fancy frameworks tightly coupled with external
| services.
| masklinn wrote:
| Importantly there's all sorts of tests beyond trivial single-
| value unit tests. Property testing (via hypothesis, in
| python) for instance.
| gnabgib wrote:
| This seems to be your site @op.. your CSS needs attention. On a
| narrower screen (ie. portrait) the text is enormous, and worse,
| zooming out shrinks the quantity of words (increases the font-
| size).. which is the surely the opposite of expected? It's
| basically unusable.
|
| Your CSS seems to assume all portrait screens (whether 80" or 3")
| deserve the same treatment.
| stephantul wrote:
| This is cool! I think that, in general, generating test cases
| "offline" using an LLM and then running them using regular unit
| testing also solves this particular issue.
|
| It also might be more transparent and cheaper.
| simianwords wrote:
| I was a bit skeptical at first but I think this is a good idea.
| Although I'm not convinced with the usage of max_depth parameter.
| In real life you rarely know what type your dependencies are if
| they are loaded at run time. This is kind of why we explicitly
| mock our dependencies.
|
| On a side note: I have wondered whether LLM's are particularly
| good with functional languages. Imagine if your code entirely
| consisted of just pure functions and no side effects. You pass
| all parameters required and do not use static methods/variables
| and no OOP concepts like inheritance. I imagine every program can
| be converted in such a way, the tradeoff being human readability.
| jonathanlydall wrote:
| If you're stuck with dynamically typed languages, then tests like
| this can make a lot of sense.
|
| On statically typed languages this happens for free at compile
| time.
|
| I've often heard proponents of dynamically typed languages say
| how all the typing and boiler plate required by statically typed
| languages feels like such a waste of time, and on a small enough
| system maybe they are right.
|
| But on any significant sized code bases, they pay dividends over
| and over by saving you from having to make tests like this.
|
| They also allow trivial refactoring that people using dynamically
| typed languages wouldn't even consider due to the risk being so
| high.
|
| So keep this all in mind when you next choose your language for a
| new project.
| ngruhn wrote:
| I think at least some people who say this think of Java-esque
| type systems. And there I agree: it is a boilerplate nightmare.
| motorest wrote:
| > But on any significant sized code bases, they pay dividends
| over and over by saving you from having to make tests like
| this.
|
| I firmly believe that the group of people who laud dynamically
| typed languages as efficient time-savers, that help shed drudge
| work involving typing, is tightly correlated with the group of
| people who fail to establish any form of quality assurance or
| testing, often using the same arguments to justify their
| motivation.
| 0xDEAFBEAD wrote:
| The question I find interesting is whether type systems are
| an efficient way to buy reliability relative to other ways to
| purchase reliability, such as writing tests, doing code
| review, or enforcing immutability.
|
| Of course, some programmers just don't care about purchasing
| reliability. Those are the ones who eschew type systems, and
| tests, and produce unreliable software, about like you'd
| expect. But for my purposes, this is besides the point.
| bluGill wrote:
| I find they are valuable. When you have a small program -
| 10k lines of code you don't really need them. However when
| you are at more than 10 million lines of code types find a
| lot of little errors that writing the correct test for
| would be hard.
|
| Most dynamically typed languages (all that I have worked
| with) cannot catch that you misspelled a function name
| until that function is called. If that misspelled function
| is in an error path it would be very easy to never test it
| until a customer hit the crash. Just having your function
| names as a strong type that is checked by static analysis
| (need not be a compiler though that is what everything
| uses) is a big win. Checking the other arguments as well is
| similarly helpful.
| globular-toast wrote:
| Rubbish, in my experience. People who understand dynamic
| languages know they need to write tests because it's the
| _only_ thing asserting correctness. I could just as easily
| say static people don 't write tests because they think the
| type system is enough. A type system is laughably bad at
| asserting correct behaviour.
|
| Personally I do use type hinting and mypy for much of my
| Python code. But I'll most certainly omit it for throwaway
| scripts and trivial stuff. I'm still not convinced it's
| really worth the effort, though. I've had a few occasions
| where the type checker has caught something important, but
| most of the time it's an autist trap where you spend ages
| making it correct "just because".
| motorest wrote:
| > Rubbish, in my experience. People who understand dynamic
| languages know they need to write tests because it's the
| only thing asserting correctness.
|
| Tests don't assert correctness. At best they verify
| specific invariants.
|
| Statically typed languages lean on the compiler to
| automatically verify some classes of invariants (i.e., can
| I call this method in this object?)
|
| With dynamically typed languages, you cannot lean on the
| compiler to verify these invariants. Developers must fill
| in this void by writing their own tests.
|
| It's true that they "need" to do it to avoid some classes
| of runtime errors that are only possible in dynamically
| typed languages. But that's not the point. The point is
| that those who complan that statically typed languages are
| too cumbersome because they require boilerplate code for
| things type compile-time type checking are also correlated
| with the set of developers who fail to invest any time
| adding or maintaining automated test suites, because of the
| same reasons.
|
| > I could just as easily say static people don't write
| tests because they think the type system is enough. A type
| system is laughably bad at asserting correct behaviour.
|
| No, you can't. Developers who use statically typed
| languages don't even think of type checking as a concern,
| let alone a quality assurance issue.
| bluGill wrote:
| > Tests don't assert correctness. At best they verify
| specific invariants.
|
| Pedantically correct, but in practice those are close
| enough to the same thing.
|
| Even a formal proof cannot assert correctness -
| requirements are often wrong. However in practice
| requirements are close enough to correct that we can call
| a formal proof also close enough.
| 0xDEAFBEAD wrote:
| Dan Luu looked at the literature and concluded that the
| evidence for the benefit of types is underwhelming:
|
| https://danluu.com/empirical-pl/
|
| >But on any significant sized code bases, they pay dividends
| over and over by saving you from having to make tests like
| this.
|
| OK, but if the alternative to tests is spending _more_ time on
| a reliability method (type annotations) which buys you _less_
| reliability compared to writing tests... it 's hardly a win.
|
| It fundamentally seems to me that there are plenty of bugs that
| types can simply never catch. For example, if I have a "divide"
| function and I accidentally swap the numerator and divisor
| arguments, I can't think of any realistic type system which
| will help me. Other methods for achieving reliability, like
| writing tests or doing code review, don't seem to have the same
| limitations.
| Smaug123 wrote:
| > swap the numerator and divisor
|
| Even Rust can express this; you don't need to get fancy.
| Morally speaking, division takes a Num and a
| std::num::NonZero<Num>.
| UncleEntity wrote:
| > On statically typed languages this happens for free at
| compile time.
|
| If only that were true I wouldn't be a tiny bit as good at
| tracking down segfaults as I've become over the years...
| yuliyp wrote:
| Did the author do any analysis of the effectiveness of their tool
| on something beyond multiplication? Did they look to see if it
| caught any bugs in any codebases? What's the false positive rate?
| False negative?
|
| As is it's neat that they wrote some code to generate some
| prompts for an LLM but there's no idea if it actually works.
| motorest wrote:
| > Did the author do any analysis of the effectiveness of their
| tool on something beyond multiplication? Did they look to see
| if it caught any bugs in any codebases? What's the false
| positive rate? False negative?
|
| I would also add the concern on whether the tests are actually
| deterministic.
|
| The premise is also dubious, as docstring comments typically
| hold only very high-level descriptions of the implementation
| and often aren't even maintained. Writing a specification of
| what a function is expected to do is what writing tests is all
| about, and with LLMs these are a terse prompt away.
| bluGill wrote:
| Documentation should not be telling your how it is
| implemented. It should tell you how and why to use the
| function. Users who care about how it is implemented should
| be reading the code not the comments. Users who need to
| find/use a helper and get on with their feature shouldn't.
| rollulus wrote:
| I wonder if the random component of the LLM makes every test
| flaky by definition.
| dragonwriter wrote:
| This is more of "LLM code review" than any kind of testing, and
| calling it "testing" is just badly misleading.
| anself wrote:
| Agree, it's not testing. The problem is here: "In a typical
| testing workflow, you write some basic tests to check the core
| functionality. When a bug inevitably shows up--usually after
| deployment--you go back and add more tests to cover it. This
| process is reactive, time-consuming, and frankly, a bit
| tedious."
|
| This is exactly the problem that TDD solves. One of the most
| compelling reasons for test-first is because "Running the code
| in your head" does not actually work well in practice, leading
| to the above-cited issues. This is just another variant of
| "Running the code in your head" except an LLM is doing it.
| Strong TDD practices (don't write any code without a test to
| support it) will close those gaps. It may feel tedious at first
| but the safety it creates will leave you never wanting to go
| back.
|
| Where this could be safe and useful: Find gaps in the test-set.
| Places where the code was never written because there wasn't a
| test to drive it out. This is one of the hardest parts of TDD,
| and where LLMs could really help.
| IshKebab wrote:
| Yeah this sounds like a good way to detect out of date
| comments. I would have focused on that.
| spiddy wrote:
| this. Let's not confuse meanings. There are multiple ways to
| improve quality of code. Testing is one, code review is
| another. this belongs to the latter
| noodletheworld wrote:
| I don't think this is particularly terrible.
|
| Broadly speaking, linters are good, and if you have a way of
| linting implementation errors it's probably helpful.
|
| I would say it's probably more helpful while you're coding than
| at test/CI time because it will be, indubitably, flakey.
|
| However, for a local developer workflow I can see a reasonable
| value in being able to go:
|
| Take every function in my code and scan it to figure out if you
| think it's implemented correctly, and let me know if you spot
| anything that looks weird / wrong / broken. Ideally only
| functions that I've touched in my branch.
|
| So... you know. Cool idea. I think it's overselling how useful it
| is, but hey, smash your AI into every possible thing and
| eventually you'll find a few modestly interesting uses for it.
|
| This is probably a modestly interesting use case.
|
| > suite allows you to run the tests asynchronously, and since the
| main bottleneck is IO (all the computations happen in a GPU in
| the cloud) it means that you can run your tests very fast. This
| is a huge advantage in comparison to standard tests, which need
| to be run sequentially.
|
| uh... that said, saying that it's _fast_ to run your functions
| through an LLM compared to, you know, just running tests, is a
| little bit strange.
|
| I'm certain your laptop will melt if you run 500 functions in
| parallel through ollama gemma-3.
|
| Running it over a network is, obviously, similarly insane.
|
| This would also be enormously and time consuming and expensive to
| use with a hosted LLM api.
|
| The 'happy path' is probably having a plugin in your IDE that
| scans the files you touch and then runs this in the background
| when you make a commit somehow using a local LLM of sufficient
| complexity it can be useful (gemma3 would probably work).
|
| Kind of like having your tests in 'watch mode'; you don't expect
| instant feedback, but some-time-after you've done something you
| get a popup saying 'oh hey, are you sure you meant to return a
| string here..?'
|
| Maybe it would just be annoying. You'd have to build it out
| properly and see. /shrug
|
| I think it's not implausible though, that you could see something
| _vaguely like this_ that was generally useful.
|
| Probably what you see in this specific implementation is only the
| precursory contemplations of something actually useful though.
| Not really useful on its own, in its current form, imo.
| RainyDayTmrw wrote:
| I'm skeptical. Most of us maintaining medium sized codebases or
| larger are constantly fighting nondeterminism in the form of
| flaky tests. I can't imagine choosing a design that starts with
| nondeterminism baked in.
|
| And if you're really dead-set on paying nondeterminism to get
| more coverage, property-based testing has existed for a long time
| and has a comparatively solid track record.
| mrkeen wrote:
| Couldn't put it better myself.
|
| I have the toughest time trying to communicate why f(x) should
| equal f(x) in the general case.
| Garlef wrote:
| Hm... I think you have a good point.
|
| Maybe the non-determinism can be reduced by caching: Just
| reevaluate the spec if the code actually changes?
|
| I think there are also other problems (inlining a verbal
| description makes the codebase verbose, writing a precise, non-
| ambiguous verbal description might be more work than writing
| unit tests)
| carlmr wrote:
| >Maybe the non-determinism can be reduced by caching: Just
| reevaluate the spec if the code actually changes?
|
| That would be good anyway to keep the costs reasonable.
| IshKebab wrote:
| I agree. I want this as a code review tool to check if people
| forgot to update comments - "it looks like this now adds
| instead of multiplies, but the comment says otherwise; did you
| forget to update it?".
|
| Seems of dubious value as unit tests. LLMs don't seem to be
| quite smart enough for that in my experience, unless your bugs
| are _really_ as trivial as adding instead of multiplying, in
| which case god help you.
| Davidbrcz wrote:
| Many good and prolific approaches are non deterministic such as
| fuzzing or property-based testing,
| masklinn wrote:
| > But here's the catch: you're missing some edge cases. What
| about negative inputs?
|
| The docstring literally says it only works with positive
| integers, and the LLM is supposed to follow the docstring (per
| previous assertions).
|
| > The problem is that traditional tests can only cover a narrow
| slice of your function's behavior.
|
| Property tests? Fuzzers? Symbolic execution?
|
| > Just because a high percentage of tests pass doesn't mean your
| code is bug-free.
|
| Neither does this thing. If you want your code to be bug-free
| what you're looking for is a proof assistant not vibe-reviewing.
|
| Also
|
| > One of the reasons to use suite is its seamless integration
| with pytest.
|
| Exposing a predicate is not "seamless integration with pytest",
| it's just exposing a predicate.
| cerpins wrote:
| It sounds like it might be a good use case for testing
| documentation - verifying whether what documentation describes is
| actually in accordance with the code, and then you can act on it.
| With that in mind, it's also probably pointless to re-run if
| relevant code or documentation hasn't changed.
| vouwfietsman wrote:
| Maybe someone can help me out here:
|
| I always get the feeling that fundamentally our software should
| be built on a foundation of sound logic and reasoning. That
| doesn't mean that we cannot use LLMs to build that software, but
| it does mean that in the end every line of code must be validated
| to make sure there's no issues injected by the LLM tools that
| inherently lack logic and reasoning, or at least such validation
| must be on par with human authored code + review. Because of
| this, the validation cannot be done by an LLM, as it would just
| compound the problem.
|
| Unless we get a drastic change in the level of error detection
| and self-validation that can be done by an LLM, this remains a
| problem for the foreseeable future.
|
| How is it then that people build tooling where the LLM validates
| the code they write? Or claim 2x speedups for code written by
| LLMs? Is there some kind of false positive/negative tradeoff I'm
| missing that allows people to extract robust software from an
| inherently not-robust generation process?
|
| I'm not talking about search and documentation, where I'm already
| seeing a lot of benefit from LLMs today, because between the LLM
| output and the code is me, sanity checking and filtering
| everything. What I'm asking about is the: "LLM take the wheel!"
| type engineering.
| darawk wrote:
| This particular person seems to be using LLMs for code review,
| not generation. I agree that the problem is compounded if you
| use an LLM (esp. the same model) on both sides. However, it
| seems reasonable and useful to use it as an _adjunct_ to other
| forms of testing, though not necessarily a replacement for
| them. Though again, the degree to which it can be a replacement
| is a function of the level of the technology, and it is
| currently at the level where it can probably replace _some_
| traditional testing methods, though it 's hard to know which,
| ex-ante.
|
| edit: of course, maybe that means we need a meta-suite, that
| uses a different LLM to tell you which tests you should write
| yourself and which tests you can safely leave to LLM review.
| vouwfietsman wrote:
| Indeed the idea of a meta LLM, or some sort of clear
| distinction between manual and automated-but-questionable
| tests makes sense. So what bothers me is that does not seem
| to be the approach most people take: code produced by the LLM
| is treated the same as code produces by human authors.
| motorest wrote:
| > That doesn't mean that we cannot use LLMs to build that
| software, but it does mean that in the end every line of code
| must be validated to make sure there's no issues injected by
| the LLM tools that inherently (...)
|
| The problem with your assertion is that it fails to understand
| that today's software, where every single line of code was
| typed in by real flesh-and-bone humans, already fails to have
| adequate test coverages, let alone be validated.
|
| The main problem with output from LLMs is that they were
| trained with the code written by humans, and thus they
| accurately reflect the quality of the code that's found in the
| wild. Consequently, your line of reasoning actually criticizes
| LLMs for outputing the same unreliable code that people write.
|
| Counterintuitively, LLMs end up generating a better output
| because at least they are designed to simplify the task of
| automatically generating tests.
| vouwfietsman wrote:
| Right but by your reasoning it would make sense to use LLMs
| only to augment an incomplete but rigorous testing process,
| or to otherwise elevate below average code.
|
| My issue is not necessarily with the quality of the code, but
| rather with the intention of the code, which is much more
| important: a good design without tests is more durable than a
| bad design with tests.
| motorest wrote:
| > Right but by your reasoning it would make sense to use
| LLMs only to augment an incomplete but rigorous testing
| process, or to otherwise elevate below average code.
|
| No. It makes sense to use LLMs to generate tests. Even if
| their output matches the worst output the average human can
| write by hand, having any coverage whatsoever already
| raises the bar from where the average human output is.
|
| > My issue is not necessarily with the quality of the code,
| but rather with the intention of the code (...)
|
| That's not the LLM's responsibility. Humans specify what
| they want and LLMs fill in the blanks. If today's LLMs
| output bad results, that's a reflection of the prompts.
| Garbage in, garbage out.
| vouwfietsman wrote:
| > No. It makes sense to use LLMs to generate tests. Even
| if their output matches the worst output the average
| human can write by hand, having any coverage whatsoever
| already raises the bar from where the average human
| output is.
|
| Although this is true, it disregards the fact that
| prompting for tests takes time which may also be spent
| writing tests, and its not clear if poor quality tests
| are free, in the sense that further development may cause
| these tests to fail for the wrong reasons, causing time
| spent debugging. This is why I used the word "augment":
| these tests are clearly not the same quality as manual
| tests, and should be considered separately from manual
| tests. In other words, they may serve to elevate below
| average code or augment manual tests, but not more than
| that. Again, I'm not saying it makes no sense to do this.
|
| > That's not the LLM's responsibility. Humans specify
| what they want and LLMs fill in the blanks. If today's
| LLMs output bad results, that's a reflection of the
| prompts. Garbage in, garbage out.
|
| This is unlikely to be true, for a couple reasons: 1.
| Ambiguity makes it impossible to define "garbage", see
| prompt engineering. In fact, all human natural language
| output is garbage in the context of programming. 2. As
| the LLM fills in blanks, it must do so respecting the
| intention of the code, otherwise the intention of the
| code erodes, and its design is lost. 3. This would imply
| that LLMs have reached their peak and only improve by
| requiring less prompting by a user, this is simply not
| true as it is trivial to currently find problems an LLM
| cannot solve, regardless of the amount of prompting.
| motorest wrote:
| > Although this is true, it disregards the fact that
| prompting for tests takes time which may also be spent
| writing tests (...)
|
| No, not today at least. Some services like Copilot
| provide plugins that implement actions to automatically
| generate unit tests. This means that the unit test
| coverage you're describing is a right-click away.
|
| https://code.visualstudio.com/docs/copilot/copilot-smart-
| act...
|
| > (...).and its not clear if poor quality tests are free,
| in the sense that further development may cause these
| tests to fail for the wrong reasons, causing time spent
| debugging.
|
| That's not how automated tests work. If you have a green
| test that turns red when you touch some part of the code,
| this is the test working as expected, because your code
| change just introduced unexpected changes that violated
| an invariant.
|
| Also, today's LLMs are able to recreate all your unit
| tests from scratch.
|
| > This is unlikely to be true, for a couple reasons: 1.
| Ambiguity makes it impossible to define "garbage", see
| prompt engineering.
|
| "Ambiguity" is garbage in this context.
|
| > . 2. As the LLM fills in blanks, it must do so
| respecting the intention of the code, otherwise the
| intention of the code erodes, and its design is lost.
|
| That's the responsibility of the developer, not the LLM.
| Garbage in, garbage out.
|
| > . 3. This would imply that LLMs have reached their peak
| and only improve by requiring less prompting by a user,
| this is simply not true as it is trivial to currently
| find problems an LLM cannot solve, regardless of the
| amount of prompting.
|
| I don't think that point is relevant. The goal of a
| developer is still to meet the definition of done, not to
| tie their hands around their back and expect working code
| to just fall on their lap. Currently the main approach to
| vibe coding is to set the architecture, and lean on the
| LLM to progressively go from high level to low level
| details. Speaking from personal experience in vibecoding,
| LLMs are quite capable of delivering fully working apps
| with a single, detailed prompt. However, you get far more
| satisfactory results (i.e., the app reflects the same
| errors in judgement you'd make) if you just draft a
| skeleton and progressively fill in the blanks.
| vouwfietsman wrote:
| > That's not how automated tests work > today's LLMs are
| able to recreate all your unit tests from scratch. >
| That's the responsibility of the developer > LLMs are
| quite capable of delivering fully working apps with a
| single, detailed prompt
|
| You seem to be very resolute in positing generalizations,
| I think those are rarely true. I don't see a lot of
| benefit coming out of a discussion like this. Try reading
| my replies as if you agree with them, it will help you
| better understand my point of view, which will make your
| criticism more targeted, so you can avoid
| generalizations.
| UncleEntity wrote:
| From my testing the robots seem to 'understand' the code more
| than just learn how do thing X in code from reading code
| about doing X. I've thrown research papers at them and they
| just 'get' what needs to be done to take the idea and
| implement it as a library or whatever. Or, what has become my
| favorite activity of late, give them some code and ask them
| how they would make it better -- then take that and split it
| up into simpler tasks because they get confused it you ask
| them to do too much at one time.
|
| As for debugging, they're not so good at that. Some debugging
| they can figure out but if they need to do something simple,
| like counting how far away item A is from item B, then I've
| found you pretty much have to do that for them. Don't get me
| wrong, they've found some pretty deep bugs I would have spend
| a bunch of time tracking down in gdb, so they aren't
| completely worthless but I have definitely given up on the
| idea that I can just tell them the problem and they get to
| work fixing it though.
|
| And, yeah, they're good at writing tests. I usually work on
| python C modules and my typical testing is playing with it in
| the repl but my current project is getting fully tested at
| the C level before I have gotten around to the python wrapper
| code.
|
| Overall its been pretty productive using the robots, code is
| being written I wouldn't have spent the time working on, unit
| testing is being used to make sure they don't break anything
| as the project progresses and the codebase is being kept
| pretty sound because I know enough to see when they're going
| off the rails as they often do.
| PeterStuer wrote:
| If you are working with natural language, it is by definition
| 'fuzzy' unless you reduce it to simple templates. So to
| evaluate whether an output is a _semantically_ e.g. a
| reasonable answer to an input where non-templated natural
| verbalization is needed, you need something that 'tests' the
| output, and that is not going to be purely 'logical'.
|
| Will that test be perfect? No. But what is the alternative?
| vouwfietsman wrote:
| Are you referring to the process of requirement engineering?
| Because although I agree its a fuzzy natural language
| interface, behind the interface _should_ be (heavy should) a
| rigorously defined & designed system, where fuzzyness is
| eliminated. The LLMs need to work primarily with the rigorous
| definition, not the fuzzyness.
| PeterStuer wrote:
| It depends on the use case. e.g. Music generation like
| Suno. How do you rigorously and logically check the output?
| Or an automated copy-writing service?
|
| The tests should match the rigidity of the case. A mismatch
| in modality will lead to bad outcomes.
| vouwfietsman wrote:
| Aha! Like that. Yes that's interesting, the only other
| alternative would be manual classification of novel data,
| so extremely labour intensive. If an LLM is able to do
| the same classification automatically it opens up use
| cases that are otherwise indeed impossible.
| InkCanon wrote:
| It's a common idea, all the way back to Hoare logic. There was
| a time when people believed in the future, people would write
| specifications instead of code.
|
| The problem with it takes several times more effort to verify
| code than to write it. This makes intuitive sense if you
| consider that the search space for the properties of code is
| much larger than the code for space. Rice theorem's states that
| all non trivial semantic properties of a program are
| undeniable.
| Smaug123 wrote:
| No, Rice's theorem states that there is no _general_
| procedure to take an _arbitrary_ program and decide
| nontrivial properties of its behaviour. As software
| engineers, though, we write _specific_ programs which have
| properties which can be decided, perhaps by reasoning
| _specific to the program_. (That 's, like, the whole point of
| software engineering: you can't claim to have solved a
| problem if you wrote a program such that it's undecidable
| whether it solved the problem.)
|
| The "several times more effort to verify code" thing: I'm
| hoping the next few generations of LLMs will be able to do
| this properly! Imagine if you were writing in a dependently
| typed language, and you wrote your test as simply a theorem,
| and used a very competent LLM (perhaps with other program
| search techniques; who knows) to fill in the proof, which
| nobody will never read. Seems like a natural end state of the
| OP: more compute may relax the constraints on writing
| software whose behaviour is formally verifiable.
| lgiordano_notte wrote:
| LLM-based coding only really works when wrapped in structured
| prompts, constrained outputs, external checks etc. The systems
| that work well aren't just 'LLM take the wheel' architecture,
| they're carefully engineered pipelines. Most success stories
| are more about that scaffolding than the model itself.
| CivBase wrote:
| Does anyone provide a good breakdown of how much time/cost
| goes into the scaffolding vs how much is saved from not
| writing the code itself?
| lgiordano_notte wrote:
| A breakdown would be interesting. I can't give you hard
| numbers, but in our case scaffolding was most of the work.
| Getting the model to act reliably meant building structured
| abstractions, retries, output validation, context tracking,
| etc. Once that's in place you start saving time per task,
| but there's a cost up front.
| sigtstp wrote:
| I feel this makes some fundamental conceptual mistakes and is
| just riding the LLM wave.
|
| "Semantics" is literally behavior under execution. This is
| syntactical analysis by a stochastic language model. I know the
| NLP literature uses "semantics" to talk about representations but
| that is an assertion which is contested [1].
|
| Coming back to testing, this implicitly relies on the strong
| assumption of the LLM correctly associating the code (syntax)
| with assertions of properties under execution (semantic
| properties). This is a very risky assumption considering, once
| again, these things are stochastic in nature and cannot even
| guarantee syntactical correctness, let alone semantic. Being
| generous with the former, there is a track record of the latter
| often failing and producing subtle bugs [2][3][4][5]. Not to
| mention the observed effect of LLMs often being biased to "agree"
| with the premise presented to them.
|
| It also kind of misses the point of testing, which is the
| engineering (not automation) task of reasoning about code and
| doing QC (even if said tests are later run automatically, I'm
| talking about their conception). I feel it's a dangerous, albeit
| tempting, decision to relegate that to an LLM. Fuzzing, sure. But
| not assertions about program behavior.
|
| [1] A Primer in BERTology: What we know about how BERT works
| https://arxiv.org/abs/2002.12327 (Layers encode a mix of
| syntactic and semantic aspects of natural language, and it's
| problem-specific.)
|
| [2] Large Language Models of Code Fail at Completing Code with
| Potential Bugs https://arxiv.org/abs/2306.03438
|
| [3] SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World
| Freelance Software Engineering? https://arxiv.org/abs/2502.12115
| (best models unable to solve the majority of coding problems)
|
| [4] Evaluating the Code Quality of AI-Assisted Code Generation
| Tools: An Empirical Study on GitHub Copilot, Amazon
| CodeWhisperer, and ChatGPT https://arxiv.org/abs/2304.10778
|
| [5] Is Stack Overflow Obsolete? An Empirical Study of the
| Characteristics of ChatGPT Answers to Stack Overflow Questions
| https://arxiv.org/abs/2308.02312v4
|
| EDIT: Added references
| stoical1 wrote:
| Test driving a car by looking at it
| evanb wrote:
| > Beware of bugs in the above code; I have only proved it
| correct, not tried it.
|
| -- Donald Knuth, Notes on the van Emde Boas construction of
| priority deques: An instructive use of recursion (1977)
|
| https://www-cs-faculty.stanford.edu/~knuth/faq.html
| jonstewart wrote:
| Does this buy carbon offsets, too?
| lgiordano_notte wrote:
| Treating docstrings as the spec and asking an LLM to flag
| mismatches feels promising in theory but personally I'd b wary of
| overfitting to underspecified docs. Might be useful as a lint-
| like signal, but hard to see it replacing real tests just yet.
| bluGill wrote:
| if that is the only testing you do I agree. However to test
| that the code works as the docs say is valuable as well. The
| code often will do more, but it needs to do at least what the
| docs say.
| lgiordano_notte wrote:
| Agreed. Catching mismatches between doc and implementation is
| still valuable, just wouldn't want people to rely on it as a
| safety net when the docs themselves might be
| inaccurate/incomplete. As a complement to traditional tests
| though seems like a solid addition.
| JanSchu wrote:
| Interesting experiment. I like that you framed it as "tests that
| read the docs" rather than "AI will magically find bugs", because
| the former is exactly where LLMs shine: cross-checking natural-
| language intent with code.
|
| A couple of thoughts after playing with a similar idea in private
| repos:
|
| Token pressure is the real ceiling. Even moderately sized modules
| explode past 32k tokens once you inline dependencies and long
| docstrings. Chunking by call-graph depth helps, but at some point
| you need aggressive summarization or cropping, otherwise you burn
| GPU time on boilerplate.
|
| False confidence is worse than no test. LLMs love to pass your
| suite when the code and docstring are both wrong in the same way.
| I mitigated this by flipping the prompt: ask the model to propose
| three subtle, realistic bugs first, then check the implementation
| for each. The adversarial stance lowered the "looks good to me"
| rate.
|
| Structured outputs let you fuse with traditional tests. If the
| model says passed: false, emit a property-based test via
| Hypothesis that tries to hit the reasoning path it complained
| about. That way a human can reproduce the failure locally without
| a model in the loop.
|
| Security review angle. LLM can spot obvious injection risks or
| unsafe eval calls even before SAST kicks in. Semantic tests that
| flag any use of exec, subprocess, or bare SQL are surprisingly
| helpful.
|
| CI ergonomics. Running suite on pull requests only for files that
| changed keeps latency and costs sane. We cache model responses
| keyed by file hash so re-runs are basically free.
|
| Overall I would not drop my pytest corpus, but I would keep an
| async "semantic diff" bot around to yell when a quick refactor
| drifts away from the docstring. That feels like the sweet spot
| today.
|
| P.S. If you want a local setup, Mistral-7B-Instruct via Ollama is
| plenty smart for doc/code mismatch checks and fits on a MacBook
| jmull wrote:
| This is probably better thought of as AI-assisted code review
| rather than unit testing.
|
| Although you can automate running this test...
|
| 1. You may not want to blow up your token budget.
|
| 2. You probably want to manually review/use the results.
| brap wrote:
| Skepticism aside, I think this would have worked better as a
| linter rule. 100% coverage out of the box. Or opt-in with linter
| comments.
| gavmor wrote:
| If you don't try static typing, first, I feel like you're leaving
| money on the table... on your way to burn a pile of money.
|
| Right? If you're looking to reduce bugs and errors... this is
| like putting a jetpack on a window-washer without even
| considering a carabiner harness.
___________________________________________________________________
(page generated 2025-05-05 23:02 UTC)