[HN Gopher] Semantic unit testing: test code without executing it
       ___________________________________________________________________
        
       Semantic unit testing: test code without executing it
        
       Author : alexmolas
       Score  : 70 points
       Date   : 2025-05-03 09:44 UTC (2 days ago)
        
 (HTM) web link (www.alexmolas.com)
 (TXT) w3m dump (www.alexmolas.com)
        
       | cjfd wrote:
       | Much better solution: don't write useless docstrings.
        
         | motorest wrote:
         | > Much better solution: don't write useless docstrings.
         | 
         | Actually writing the tests is far more effective, and doesn't
         | require fancy frameworks tightly coupled with external
         | services.
        
           | masklinn wrote:
           | Importantly there's all sorts of tests beyond trivial single-
           | value unit tests. Property testing (via hypothesis, in
           | python) for instance.
        
       | gnabgib wrote:
       | This seems to be your site @op.. your CSS needs attention. On a
       | narrower screen (ie. portrait) the text is enormous, and worse,
       | zooming out shrinks the quantity of words (increases the font-
       | size).. which is the surely the opposite of expected? It's
       | basically unusable.
       | 
       | Your CSS seems to assume all portrait screens (whether 80" or 3")
       | deserve the same treatment.
        
       | stephantul wrote:
       | This is cool! I think that, in general, generating test cases
       | "offline" using an LLM and then running them using regular unit
       | testing also solves this particular issue.
       | 
       | It also might be more transparent and cheaper.
        
       | simianwords wrote:
       | I was a bit skeptical at first but I think this is a good idea.
       | Although I'm not convinced with the usage of max_depth parameter.
       | In real life you rarely know what type your dependencies are if
       | they are loaded at run time. This is kind of why we explicitly
       | mock our dependencies.
       | 
       | On a side note: I have wondered whether LLM's are particularly
       | good with functional languages. Imagine if your code entirely
       | consisted of just pure functions and no side effects. You pass
       | all parameters required and do not use static methods/variables
       | and no OOP concepts like inheritance. I imagine every program can
       | be converted in such a way, the tradeoff being human readability.
        
       | jonathanlydall wrote:
       | If you're stuck with dynamically typed languages, then tests like
       | this can make a lot of sense.
       | 
       | On statically typed languages this happens for free at compile
       | time.
       | 
       | I've often heard proponents of dynamically typed languages say
       | how all the typing and boiler plate required by statically typed
       | languages feels like such a waste of time, and on a small enough
       | system maybe they are right.
       | 
       | But on any significant sized code bases, they pay dividends over
       | and over by saving you from having to make tests like this.
       | 
       | They also allow trivial refactoring that people using dynamically
       | typed languages wouldn't even consider due to the risk being so
       | high.
       | 
       | So keep this all in mind when you next choose your language for a
       | new project.
        
         | ngruhn wrote:
         | I think at least some people who say this think of Java-esque
         | type systems. And there I agree: it is a boilerplate nightmare.
        
         | motorest wrote:
         | > But on any significant sized code bases, they pay dividends
         | over and over by saving you from having to make tests like
         | this.
         | 
         | I firmly believe that the group of people who laud dynamically
         | typed languages as efficient time-savers, that help shed drudge
         | work involving typing, is tightly correlated with the group of
         | people who fail to establish any form of quality assurance or
         | testing, often using the same arguments to justify their
         | motivation.
        
           | 0xDEAFBEAD wrote:
           | The question I find interesting is whether type systems are
           | an efficient way to buy reliability relative to other ways to
           | purchase reliability, such as writing tests, doing code
           | review, or enforcing immutability.
           | 
           | Of course, some programmers just don't care about purchasing
           | reliability. Those are the ones who eschew type systems, and
           | tests, and produce unreliable software, about like you'd
           | expect. But for my purposes, this is besides the point.
        
             | bluGill wrote:
             | I find they are valuable. When you have a small program -
             | 10k lines of code you don't really need them. However when
             | you are at more than 10 million lines of code types find a
             | lot of little errors that writing the correct test for
             | would be hard.
             | 
             | Most dynamically typed languages (all that I have worked
             | with) cannot catch that you misspelled a function name
             | until that function is called. If that misspelled function
             | is in an error path it would be very easy to never test it
             | until a customer hit the crash. Just having your function
             | names as a strong type that is checked by static analysis
             | (need not be a compiler though that is what everything
             | uses) is a big win. Checking the other arguments as well is
             | similarly helpful.
        
           | globular-toast wrote:
           | Rubbish, in my experience. People who understand dynamic
           | languages know they need to write tests because it's the
           | _only_ thing asserting correctness. I could just as easily
           | say static people don 't write tests because they think the
           | type system is enough. A type system is laughably bad at
           | asserting correct behaviour.
           | 
           | Personally I do use type hinting and mypy for much of my
           | Python code. But I'll most certainly omit it for throwaway
           | scripts and trivial stuff. I'm still not convinced it's
           | really worth the effort, though. I've had a few occasions
           | where the type checker has caught something important, but
           | most of the time it's an autist trap where you spend ages
           | making it correct "just because".
        
             | motorest wrote:
             | > Rubbish, in my experience. People who understand dynamic
             | languages know they need to write tests because it's the
             | only thing asserting correctness.
             | 
             | Tests don't assert correctness. At best they verify
             | specific invariants.
             | 
             | Statically typed languages lean on the compiler to
             | automatically verify some classes of invariants (i.e., can
             | I call this method in this object?)
             | 
             | With dynamically typed languages, you cannot lean on the
             | compiler to verify these invariants. Developers must fill
             | in this void by writing their own tests.
             | 
             | It's true that they "need" to do it to avoid some classes
             | of runtime errors that are only possible in dynamically
             | typed languages. But that's not the point. The point is
             | that those who complan that statically typed languages are
             | too cumbersome because they require boilerplate code for
             | things type compile-time type checking are also correlated
             | with the set of developers who fail to invest any time
             | adding or maintaining automated test suites, because of the
             | same reasons.
             | 
             | > I could just as easily say static people don't write
             | tests because they think the type system is enough. A type
             | system is laughably bad at asserting correct behaviour.
             | 
             | No, you can't. Developers who use statically typed
             | languages don't even think of type checking as a concern,
             | let alone a quality assurance issue.
        
               | bluGill wrote:
               | > Tests don't assert correctness. At best they verify
               | specific invariants.
               | 
               | Pedantically correct, but in practice those are close
               | enough to the same thing.
               | 
               | Even a formal proof cannot assert correctness -
               | requirements are often wrong. However in practice
               | requirements are close enough to correct that we can call
               | a formal proof also close enough.
        
         | 0xDEAFBEAD wrote:
         | Dan Luu looked at the literature and concluded that the
         | evidence for the benefit of types is underwhelming:
         | 
         | https://danluu.com/empirical-pl/
         | 
         | >But on any significant sized code bases, they pay dividends
         | over and over by saving you from having to make tests like
         | this.
         | 
         | OK, but if the alternative to tests is spending _more_ time on
         | a reliability method (type annotations) which buys you _less_
         | reliability compared to writing tests... it 's hardly a win.
         | 
         | It fundamentally seems to me that there are plenty of bugs that
         | types can simply never catch. For example, if I have a "divide"
         | function and I accidentally swap the numerator and divisor
         | arguments, I can't think of any realistic type system which
         | will help me. Other methods for achieving reliability, like
         | writing tests or doing code review, don't seem to have the same
         | limitations.
        
           | Smaug123 wrote:
           | > swap the numerator and divisor
           | 
           | Even Rust can express this; you don't need to get fancy.
           | Morally speaking, division takes a Num and a
           | std::num::NonZero<Num>.
        
         | UncleEntity wrote:
         | > On statically typed languages this happens for free at
         | compile time.
         | 
         | If only that were true I wouldn't be a tiny bit as good at
         | tracking down segfaults as I've become over the years...
        
       | yuliyp wrote:
       | Did the author do any analysis of the effectiveness of their tool
       | on something beyond multiplication? Did they look to see if it
       | caught any bugs in any codebases? What's the false positive rate?
       | False negative?
       | 
       | As is it's neat that they wrote some code to generate some
       | prompts for an LLM but there's no idea if it actually works.
        
         | motorest wrote:
         | > Did the author do any analysis of the effectiveness of their
         | tool on something beyond multiplication? Did they look to see
         | if it caught any bugs in any codebases? What's the false
         | positive rate? False negative?
         | 
         | I would also add the concern on whether the tests are actually
         | deterministic.
         | 
         | The premise is also dubious, as docstring comments typically
         | hold only very high-level descriptions of the implementation
         | and often aren't even maintained. Writing a specification of
         | what a function is expected to do is what writing tests is all
         | about, and with LLMs these are a terse prompt away.
        
           | bluGill wrote:
           | Documentation should not be telling your how it is
           | implemented. It should tell you how and why to use the
           | function. Users who care about how it is implemented should
           | be reading the code not the comments. Users who need to
           | find/use a helper and get on with their feature shouldn't.
        
       | rollulus wrote:
       | I wonder if the random component of the LLM makes every test
       | flaky by definition.
        
       | dragonwriter wrote:
       | This is more of "LLM code review" than any kind of testing, and
       | calling it "testing" is just badly misleading.
        
         | anself wrote:
         | Agree, it's not testing. The problem is here: "In a typical
         | testing workflow, you write some basic tests to check the core
         | functionality. When a bug inevitably shows up--usually after
         | deployment--you go back and add more tests to cover it. This
         | process is reactive, time-consuming, and frankly, a bit
         | tedious."
         | 
         | This is exactly the problem that TDD solves. One of the most
         | compelling reasons for test-first is because "Running the code
         | in your head" does not actually work well in practice, leading
         | to the above-cited issues. This is just another variant of
         | "Running the code in your head" except an LLM is doing it.
         | Strong TDD practices (don't write any code without a test to
         | support it) will close those gaps. It may feel tedious at first
         | but the safety it creates will leave you never wanting to go
         | back.
         | 
         | Where this could be safe and useful: Find gaps in the test-set.
         | Places where the code was never written because there wasn't a
         | test to drive it out. This is one of the hardest parts of TDD,
         | and where LLMs could really help.
        
         | IshKebab wrote:
         | Yeah this sounds like a good way to detect out of date
         | comments. I would have focused on that.
        
         | spiddy wrote:
         | this. Let's not confuse meanings. There are multiple ways to
         | improve quality of code. Testing is one, code review is
         | another. this belongs to the latter
        
       | noodletheworld wrote:
       | I don't think this is particularly terrible.
       | 
       | Broadly speaking, linters are good, and if you have a way of
       | linting implementation errors it's probably helpful.
       | 
       | I would say it's probably more helpful while you're coding than
       | at test/CI time because it will be, indubitably, flakey.
       | 
       | However, for a local developer workflow I can see a reasonable
       | value in being able to go:
       | 
       | Take every function in my code and scan it to figure out if you
       | think it's implemented correctly, and let me know if you spot
       | anything that looks weird / wrong / broken. Ideally only
       | functions that I've touched in my branch.
       | 
       | So... you know. Cool idea. I think it's overselling how useful it
       | is, but hey, smash your AI into every possible thing and
       | eventually you'll find a few modestly interesting uses for it.
       | 
       | This is probably a modestly interesting use case.
       | 
       | > suite allows you to run the tests asynchronously, and since the
       | main bottleneck is IO (all the computations happen in a GPU in
       | the cloud) it means that you can run your tests very fast. This
       | is a huge advantage in comparison to standard tests, which need
       | to be run sequentially.
       | 
       | uh... that said, saying that it's _fast_ to run your functions
       | through an LLM compared to, you know, just running tests, is a
       | little bit strange.
       | 
       | I'm certain your laptop will melt if you run 500 functions in
       | parallel through ollama gemma-3.
       | 
       | Running it over a network is, obviously, similarly insane.
       | 
       | This would also be enormously and time consuming and expensive to
       | use with a hosted LLM api.
       | 
       | The 'happy path' is probably having a plugin in your IDE that
       | scans the files you touch and then runs this in the background
       | when you make a commit somehow using a local LLM of sufficient
       | complexity it can be useful (gemma3 would probably work).
       | 
       | Kind of like having your tests in 'watch mode'; you don't expect
       | instant feedback, but some-time-after you've done something you
       | get a popup saying 'oh hey, are you sure you meant to return a
       | string here..?'
       | 
       | Maybe it would just be annoying. You'd have to build it out
       | properly and see. /shrug
       | 
       | I think it's not implausible though, that you could see something
       | _vaguely like this_ that was generally useful.
       | 
       | Probably what you see in this specific implementation is only the
       | precursory contemplations of something actually useful though.
       | Not really useful on its own, in its current form, imo.
        
       | RainyDayTmrw wrote:
       | I'm skeptical. Most of us maintaining medium sized codebases or
       | larger are constantly fighting nondeterminism in the form of
       | flaky tests. I can't imagine choosing a design that starts with
       | nondeterminism baked in.
       | 
       | And if you're really dead-set on paying nondeterminism to get
       | more coverage, property-based testing has existed for a long time
       | and has a comparatively solid track record.
        
         | mrkeen wrote:
         | Couldn't put it better myself.
         | 
         | I have the toughest time trying to communicate why f(x) should
         | equal f(x) in the general case.
        
         | Garlef wrote:
         | Hm... I think you have a good point.
         | 
         | Maybe the non-determinism can be reduced by caching: Just
         | reevaluate the spec if the code actually changes?
         | 
         | I think there are also other problems (inlining a verbal
         | description makes the codebase verbose, writing a precise, non-
         | ambiguous verbal description might be more work than writing
         | unit tests)
        
           | carlmr wrote:
           | >Maybe the non-determinism can be reduced by caching: Just
           | reevaluate the spec if the code actually changes?
           | 
           | That would be good anyway to keep the costs reasonable.
        
         | IshKebab wrote:
         | I agree. I want this as a code review tool to check if people
         | forgot to update comments - "it looks like this now adds
         | instead of multiplies, but the comment says otherwise; did you
         | forget to update it?".
         | 
         | Seems of dubious value as unit tests. LLMs don't seem to be
         | quite smart enough for that in my experience, unless your bugs
         | are _really_ as trivial as adding instead of multiplying, in
         | which case god help you.
        
         | Davidbrcz wrote:
         | Many good and prolific approaches are non deterministic such as
         | fuzzing or property-based testing,
        
       | masklinn wrote:
       | > But here's the catch: you're missing some edge cases. What
       | about negative inputs?
       | 
       | The docstring literally says it only works with positive
       | integers, and the LLM is supposed to follow the docstring (per
       | previous assertions).
       | 
       | > The problem is that traditional tests can only cover a narrow
       | slice of your function's behavior.
       | 
       | Property tests? Fuzzers? Symbolic execution?
       | 
       | > Just because a high percentage of tests pass doesn't mean your
       | code is bug-free.
       | 
       | Neither does this thing. If you want your code to be bug-free
       | what you're looking for is a proof assistant not vibe-reviewing.
       | 
       | Also
       | 
       | > One of the reasons to use suite is its seamless integration
       | with pytest.
       | 
       | Exposing a predicate is not "seamless integration with pytest",
       | it's just exposing a predicate.
        
       | cerpins wrote:
       | It sounds like it might be a good use case for testing
       | documentation - verifying whether what documentation describes is
       | actually in accordance with the code, and then you can act on it.
       | With that in mind, it's also probably pointless to re-run if
       | relevant code or documentation hasn't changed.
        
       | vouwfietsman wrote:
       | Maybe someone can help me out here:
       | 
       | I always get the feeling that fundamentally our software should
       | be built on a foundation of sound logic and reasoning. That
       | doesn't mean that we cannot use LLMs to build that software, but
       | it does mean that in the end every line of code must be validated
       | to make sure there's no issues injected by the LLM tools that
       | inherently lack logic and reasoning, or at least such validation
       | must be on par with human authored code + review. Because of
       | this, the validation cannot be done by an LLM, as it would just
       | compound the problem.
       | 
       | Unless we get a drastic change in the level of error detection
       | and self-validation that can be done by an LLM, this remains a
       | problem for the foreseeable future.
       | 
       | How is it then that people build tooling where the LLM validates
       | the code they write? Or claim 2x speedups for code written by
       | LLMs? Is there some kind of false positive/negative tradeoff I'm
       | missing that allows people to extract robust software from an
       | inherently not-robust generation process?
       | 
       | I'm not talking about search and documentation, where I'm already
       | seeing a lot of benefit from LLMs today, because between the LLM
       | output and the code is me, sanity checking and filtering
       | everything. What I'm asking about is the: "LLM take the wheel!"
       | type engineering.
        
         | darawk wrote:
         | This particular person seems to be using LLMs for code review,
         | not generation. I agree that the problem is compounded if you
         | use an LLM (esp. the same model) on both sides. However, it
         | seems reasonable and useful to use it as an _adjunct_ to other
         | forms of testing, though not necessarily a replacement for
         | them. Though again, the degree to which it can be a replacement
         | is a function of the level of the technology, and it is
         | currently at the level where it can probably replace _some_
         | traditional testing methods, though it 's hard to know which,
         | ex-ante.
         | 
         | edit: of course, maybe that means we need a meta-suite, that
         | uses a different LLM to tell you which tests you should write
         | yourself and which tests you can safely leave to LLM review.
        
           | vouwfietsman wrote:
           | Indeed the idea of a meta LLM, or some sort of clear
           | distinction between manual and automated-but-questionable
           | tests makes sense. So what bothers me is that does not seem
           | to be the approach most people take: code produced by the LLM
           | is treated the same as code produces by human authors.
        
         | motorest wrote:
         | > That doesn't mean that we cannot use LLMs to build that
         | software, but it does mean that in the end every line of code
         | must be validated to make sure there's no issues injected by
         | the LLM tools that inherently (...)
         | 
         | The problem with your assertion is that it fails to understand
         | that today's software, where every single line of code was
         | typed in by real flesh-and-bone humans, already fails to have
         | adequate test coverages, let alone be validated.
         | 
         | The main problem with output from LLMs is that they were
         | trained with the code written by humans, and thus they
         | accurately reflect the quality of the code that's found in the
         | wild. Consequently, your line of reasoning actually criticizes
         | LLMs for outputing the same unreliable code that people write.
         | 
         | Counterintuitively, LLMs end up generating a better output
         | because at least they are designed to simplify the task of
         | automatically generating tests.
        
           | vouwfietsman wrote:
           | Right but by your reasoning it would make sense to use LLMs
           | only to augment an incomplete but rigorous testing process,
           | or to otherwise elevate below average code.
           | 
           | My issue is not necessarily with the quality of the code, but
           | rather with the intention of the code, which is much more
           | important: a good design without tests is more durable than a
           | bad design with tests.
        
             | motorest wrote:
             | > Right but by your reasoning it would make sense to use
             | LLMs only to augment an incomplete but rigorous testing
             | process, or to otherwise elevate below average code.
             | 
             | No. It makes sense to use LLMs to generate tests. Even if
             | their output matches the worst output the average human can
             | write by hand, having any coverage whatsoever already
             | raises the bar from where the average human output is.
             | 
             | > My issue is not necessarily with the quality of the code,
             | but rather with the intention of the code (...)
             | 
             | That's not the LLM's responsibility. Humans specify what
             | they want and LLMs fill in the blanks. If today's LLMs
             | output bad results, that's a reflection of the prompts.
             | Garbage in, garbage out.
        
               | vouwfietsman wrote:
               | > No. It makes sense to use LLMs to generate tests. Even
               | if their output matches the worst output the average
               | human can write by hand, having any coverage whatsoever
               | already raises the bar from where the average human
               | output is.
               | 
               | Although this is true, it disregards the fact that
               | prompting for tests takes time which may also be spent
               | writing tests, and its not clear if poor quality tests
               | are free, in the sense that further development may cause
               | these tests to fail for the wrong reasons, causing time
               | spent debugging. This is why I used the word "augment":
               | these tests are clearly not the same quality as manual
               | tests, and should be considered separately from manual
               | tests. In other words, they may serve to elevate below
               | average code or augment manual tests, but not more than
               | that. Again, I'm not saying it makes no sense to do this.
               | 
               | > That's not the LLM's responsibility. Humans specify
               | what they want and LLMs fill in the blanks. If today's
               | LLMs output bad results, that's a reflection of the
               | prompts. Garbage in, garbage out.
               | 
               | This is unlikely to be true, for a couple reasons: 1.
               | Ambiguity makes it impossible to define "garbage", see
               | prompt engineering. In fact, all human natural language
               | output is garbage in the context of programming. 2. As
               | the LLM fills in blanks, it must do so respecting the
               | intention of the code, otherwise the intention of the
               | code erodes, and its design is lost. 3. This would imply
               | that LLMs have reached their peak and only improve by
               | requiring less prompting by a user, this is simply not
               | true as it is trivial to currently find problems an LLM
               | cannot solve, regardless of the amount of prompting.
        
               | motorest wrote:
               | > Although this is true, it disregards the fact that
               | prompting for tests takes time which may also be spent
               | writing tests (...)
               | 
               | No, not today at least. Some services like Copilot
               | provide plugins that implement actions to automatically
               | generate unit tests. This means that the unit test
               | coverage you're describing is a right-click away.
               | 
               | https://code.visualstudio.com/docs/copilot/copilot-smart-
               | act...
               | 
               | > (...).and its not clear if poor quality tests are free,
               | in the sense that further development may cause these
               | tests to fail for the wrong reasons, causing time spent
               | debugging.
               | 
               | That's not how automated tests work. If you have a green
               | test that turns red when you touch some part of the code,
               | this is the test working as expected, because your code
               | change just introduced unexpected changes that violated
               | an invariant.
               | 
               | Also, today's LLMs are able to recreate all your unit
               | tests from scratch.
               | 
               | > This is unlikely to be true, for a couple reasons: 1.
               | Ambiguity makes it impossible to define "garbage", see
               | prompt engineering.
               | 
               | "Ambiguity" is garbage in this context.
               | 
               | > . 2. As the LLM fills in blanks, it must do so
               | respecting the intention of the code, otherwise the
               | intention of the code erodes, and its design is lost.
               | 
               | That's the responsibility of the developer, not the LLM.
               | Garbage in, garbage out.
               | 
               | > . 3. This would imply that LLMs have reached their peak
               | and only improve by requiring less prompting by a user,
               | this is simply not true as it is trivial to currently
               | find problems an LLM cannot solve, regardless of the
               | amount of prompting.
               | 
               | I don't think that point is relevant. The goal of a
               | developer is still to meet the definition of done, not to
               | tie their hands around their back and expect working code
               | to just fall on their lap. Currently the main approach to
               | vibe coding is to set the architecture, and lean on the
               | LLM to progressively go from high level to low level
               | details. Speaking from personal experience in vibecoding,
               | LLMs are quite capable of delivering fully working apps
               | with a single, detailed prompt. However, you get far more
               | satisfactory results (i.e., the app reflects the same
               | errors in judgement you'd make) if you just draft a
               | skeleton and progressively fill in the blanks.
        
               | vouwfietsman wrote:
               | > That's not how automated tests work > today's LLMs are
               | able to recreate all your unit tests from scratch. >
               | That's the responsibility of the developer > LLMs are
               | quite capable of delivering fully working apps with a
               | single, detailed prompt
               | 
               | You seem to be very resolute in positing generalizations,
               | I think those are rarely true. I don't see a lot of
               | benefit coming out of a discussion like this. Try reading
               | my replies as if you agree with them, it will help you
               | better understand my point of view, which will make your
               | criticism more targeted, so you can avoid
               | generalizations.
        
           | UncleEntity wrote:
           | From my testing the robots seem to 'understand' the code more
           | than just learn how do thing X in code from reading code
           | about doing X. I've thrown research papers at them and they
           | just 'get' what needs to be done to take the idea and
           | implement it as a library or whatever. Or, what has become my
           | favorite activity of late, give them some code and ask them
           | how they would make it better -- then take that and split it
           | up into simpler tasks because they get confused it you ask
           | them to do too much at one time.
           | 
           | As for debugging, they're not so good at that. Some debugging
           | they can figure out but if they need to do something simple,
           | like counting how far away item A is from item B, then I've
           | found you pretty much have to do that for them. Don't get me
           | wrong, they've found some pretty deep bugs I would have spend
           | a bunch of time tracking down in gdb, so they aren't
           | completely worthless but I have definitely given up on the
           | idea that I can just tell them the problem and they get to
           | work fixing it though.
           | 
           | And, yeah, they're good at writing tests. I usually work on
           | python C modules and my typical testing is playing with it in
           | the repl but my current project is getting fully tested at
           | the C level before I have gotten around to the python wrapper
           | code.
           | 
           | Overall its been pretty productive using the robots, code is
           | being written I wouldn't have spent the time working on, unit
           | testing is being used to make sure they don't break anything
           | as the project progresses and the codebase is being kept
           | pretty sound because I know enough to see when they're going
           | off the rails as they often do.
        
         | PeterStuer wrote:
         | If you are working with natural language, it is by definition
         | 'fuzzy' unless you reduce it to simple templates. So to
         | evaluate whether an output is a _semantically_ e.g. a
         | reasonable answer to an input where non-templated natural
         | verbalization is needed, you need something that  'tests' the
         | output, and that is not going to be purely 'logical'.
         | 
         | Will that test be perfect? No. But what is the alternative?
        
           | vouwfietsman wrote:
           | Are you referring to the process of requirement engineering?
           | Because although I agree its a fuzzy natural language
           | interface, behind the interface _should_ be (heavy should) a
           | rigorously defined  & designed system, where fuzzyness is
           | eliminated. The LLMs need to work primarily with the rigorous
           | definition, not the fuzzyness.
        
             | PeterStuer wrote:
             | It depends on the use case. e.g. Music generation like
             | Suno. How do you rigorously and logically check the output?
             | Or an automated copy-writing service?
             | 
             | The tests should match the rigidity of the case. A mismatch
             | in modality will lead to bad outcomes.
        
               | vouwfietsman wrote:
               | Aha! Like that. Yes that's interesting, the only other
               | alternative would be manual classification of novel data,
               | so extremely labour intensive. If an LLM is able to do
               | the same classification automatically it opens up use
               | cases that are otherwise indeed impossible.
        
         | InkCanon wrote:
         | It's a common idea, all the way back to Hoare logic. There was
         | a time when people believed in the future, people would write
         | specifications instead of code.
         | 
         | The problem with it takes several times more effort to verify
         | code than to write it. This makes intuitive sense if you
         | consider that the search space for the properties of code is
         | much larger than the code for space. Rice theorem's states that
         | all non trivial semantic properties of a program are
         | undeniable.
        
           | Smaug123 wrote:
           | No, Rice's theorem states that there is no _general_
           | procedure to take an _arbitrary_ program and decide
           | nontrivial properties of its behaviour. As software
           | engineers, though, we write _specific_ programs which have
           | properties which can be decided, perhaps by reasoning
           | _specific to the program_. (That 's, like, the whole point of
           | software engineering: you can't claim to have solved a
           | problem if you wrote a program such that it's undecidable
           | whether it solved the problem.)
           | 
           | The "several times more effort to verify code" thing: I'm
           | hoping the next few generations of LLMs will be able to do
           | this properly! Imagine if you were writing in a dependently
           | typed language, and you wrote your test as simply a theorem,
           | and used a very competent LLM (perhaps with other program
           | search techniques; who knows) to fill in the proof, which
           | nobody will never read. Seems like a natural end state of the
           | OP: more compute may relax the constraints on writing
           | software whose behaviour is formally verifiable.
        
         | lgiordano_notte wrote:
         | LLM-based coding only really works when wrapped in structured
         | prompts, constrained outputs, external checks etc. The systems
         | that work well aren't just 'LLM take the wheel' architecture,
         | they're carefully engineered pipelines. Most success stories
         | are more about that scaffolding than the model itself.
        
           | CivBase wrote:
           | Does anyone provide a good breakdown of how much time/cost
           | goes into the scaffolding vs how much is saved from not
           | writing the code itself?
        
             | lgiordano_notte wrote:
             | A breakdown would be interesting. I can't give you hard
             | numbers, but in our case scaffolding was most of the work.
             | Getting the model to act reliably meant building structured
             | abstractions, retries, output validation, context tracking,
             | etc. Once that's in place you start saving time per task,
             | but there's a cost up front.
        
       | sigtstp wrote:
       | I feel this makes some fundamental conceptual mistakes and is
       | just riding the LLM wave.
       | 
       | "Semantics" is literally behavior under execution. This is
       | syntactical analysis by a stochastic language model. I know the
       | NLP literature uses "semantics" to talk about representations but
       | that is an assertion which is contested [1].
       | 
       | Coming back to testing, this implicitly relies on the strong
       | assumption of the LLM correctly associating the code (syntax)
       | with assertions of properties under execution (semantic
       | properties). This is a very risky assumption considering, once
       | again, these things are stochastic in nature and cannot even
       | guarantee syntactical correctness, let alone semantic. Being
       | generous with the former, there is a track record of the latter
       | often failing and producing subtle bugs [2][3][4][5]. Not to
       | mention the observed effect of LLMs often being biased to "agree"
       | with the premise presented to them.
       | 
       | It also kind of misses the point of testing, which is the
       | engineering (not automation) task of reasoning about code and
       | doing QC (even if said tests are later run automatically, I'm
       | talking about their conception). I feel it's a dangerous, albeit
       | tempting, decision to relegate that to an LLM. Fuzzing, sure. But
       | not assertions about program behavior.
       | 
       | [1] A Primer in BERTology: What we know about how BERT works
       | https://arxiv.org/abs/2002.12327 (Layers encode a mix of
       | syntactic and semantic aspects of natural language, and it's
       | problem-specific.)
       | 
       | [2] Large Language Models of Code Fail at Completing Code with
       | Potential Bugs https://arxiv.org/abs/2306.03438
       | 
       | [3] SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World
       | Freelance Software Engineering? https://arxiv.org/abs/2502.12115
       | (best models unable to solve the majority of coding problems)
       | 
       | [4] Evaluating the Code Quality of AI-Assisted Code Generation
       | Tools: An Empirical Study on GitHub Copilot, Amazon
       | CodeWhisperer, and ChatGPT https://arxiv.org/abs/2304.10778
       | 
       | [5] Is Stack Overflow Obsolete? An Empirical Study of the
       | Characteristics of ChatGPT Answers to Stack Overflow Questions
       | https://arxiv.org/abs/2308.02312v4
       | 
       | EDIT: Added references
        
       | stoical1 wrote:
       | Test driving a car by looking at it
        
       | evanb wrote:
       | > Beware of bugs in the above code; I have only proved it
       | correct, not tried it.
       | 
       | -- Donald Knuth, Notes on the van Emde Boas construction of
       | priority deques: An instructive use of recursion (1977)
       | 
       | https://www-cs-faculty.stanford.edu/~knuth/faq.html
        
       | jonstewart wrote:
       | Does this buy carbon offsets, too?
        
       | lgiordano_notte wrote:
       | Treating docstrings as the spec and asking an LLM to flag
       | mismatches feels promising in theory but personally I'd b wary of
       | overfitting to underspecified docs. Might be useful as a lint-
       | like signal, but hard to see it replacing real tests just yet.
        
         | bluGill wrote:
         | if that is the only testing you do I agree. However to test
         | that the code works as the docs say is valuable as well. The
         | code often will do more, but it needs to do at least what the
         | docs say.
        
           | lgiordano_notte wrote:
           | Agreed. Catching mismatches between doc and implementation is
           | still valuable, just wouldn't want people to rely on it as a
           | safety net when the docs themselves might be
           | inaccurate/incomplete. As a complement to traditional tests
           | though seems like a solid addition.
        
       | JanSchu wrote:
       | Interesting experiment. I like that you framed it as "tests that
       | read the docs" rather than "AI will magically find bugs", because
       | the former is exactly where LLMs shine: cross-checking natural-
       | language intent with code.
       | 
       | A couple of thoughts after playing with a similar idea in private
       | repos:
       | 
       | Token pressure is the real ceiling. Even moderately sized modules
       | explode past 32k tokens once you inline dependencies and long
       | docstrings. Chunking by call-graph depth helps, but at some point
       | you need aggressive summarization or cropping, otherwise you burn
       | GPU time on boilerplate.
       | 
       | False confidence is worse than no test. LLMs love to pass your
       | suite when the code and docstring are both wrong in the same way.
       | I mitigated this by flipping the prompt: ask the model to propose
       | three subtle, realistic bugs first, then check the implementation
       | for each. The adversarial stance lowered the "looks good to me"
       | rate.
       | 
       | Structured outputs let you fuse with traditional tests. If the
       | model says passed: false, emit a property-based test via
       | Hypothesis that tries to hit the reasoning path it complained
       | about. That way a human can reproduce the failure locally without
       | a model in the loop.
       | 
       | Security review angle. LLM can spot obvious injection risks or
       | unsafe eval calls even before SAST kicks in. Semantic tests that
       | flag any use of exec, subprocess, or bare SQL are surprisingly
       | helpful.
       | 
       | CI ergonomics. Running suite on pull requests only for files that
       | changed keeps latency and costs sane. We cache model responses
       | keyed by file hash so re-runs are basically free.
       | 
       | Overall I would not drop my pytest corpus, but I would keep an
       | async "semantic diff" bot around to yell when a quick refactor
       | drifts away from the docstring. That feels like the sweet spot
       | today.
       | 
       | P.S. If you want a local setup, Mistral-7B-Instruct via Ollama is
       | plenty smart for doc/code mismatch checks and fits on a MacBook
        
       | jmull wrote:
       | This is probably better thought of as AI-assisted code review
       | rather than unit testing.
       | 
       | Although you can automate running this test...
       | 
       | 1. You may not want to blow up your token budget.
       | 
       | 2. You probably want to manually review/use the results.
        
       | brap wrote:
       | Skepticism aside, I think this would have worked better as a
       | linter rule. 100% coverage out of the box. Or opt-in with linter
       | comments.
        
       | gavmor wrote:
       | If you don't try static typing, first, I feel like you're leaving
       | money on the table... on your way to burn a pile of money.
       | 
       | Right? If you're looking to reduce bugs and errors... this is
       | like putting a jetpack on a window-washer without even
       | considering a carabiner harness.
        
       ___________________________________________________________________
       (page generated 2025-05-05 23:02 UTC)