[HN Gopher] Empirical Study of Test Generation with LLM's
       ___________________________________________________________________
        
       Empirical Study of Test Generation with LLM's
        
       Author : nickpsecurity
       Score  : 31 points
       Date   : 2024-12-28 16:10 UTC (2 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | nickpsecurity wrote:
       | Here's the EvoSuite that they refer to:
       | 
       | https://www.evosuite.org/evosuite/
        
       | bediger4000 wrote:
       | Doesn't seem to compare to human generated tests. I'm guessing
       | that comparison is mine too favorable.
        
         | nickpsecurity wrote:
         | I think humans are probably still better (a) *when they write
         | tests and (b) when they know how to. Neither is true for a lot
         | of programmers but might happen if automated cheaply.
         | 
         | I also think traditional tools are better for this. The test,
         | generation methods include path-based, combinatorial, concolic,
         | and adaptive fuzzing. Fire and forget tools that do each one,
         | suppressing duplicates, would be helpful.
         | 
         | What would be easier to train developers on are contracts (or
         | annotations). Then, add test generation either from those or
         | aided by static analysis that leverages them. Then, software
         | that analyzes the code, annotates what it can, and then asks
         | the user to fill in the holes. The LLM's could turn their
         | answers into formal contracts or specs.
         | 
         | That's how I see it happening. Maybe add patch generation into
         | that, too. At least two tools that predate LLM's already find
         | errors and suggest fixes.
        
       | jitl wrote:
       | My cursor workflow for getting tests is to make the test file,
       | import the code under test, and then type cmd-k, "unit tests for
       | <class>", enter. Add additional cmd-k to prompt for method tests
       | and cases as needed.
       | 
       | Pretty basic, would adding more shenanigans get me better
       | results?
       | 
       | I try to write doc comments for methods with contracts and it
       | seems cursor/claude does a good job reading and testing those.
        
         | peterldowns wrote:
         | What value are the tests giving you? Asking truly naively, I
         | haven't tried generating tests for code before, and I'm
         | skeptical that it would be useful for me.
        
           | jitl wrote:
           | The tests provide the same value as usual: early detection of
           | coding mistakes, (somewhat) format specification of
           | contracts, assurance the code does what it's supposed to do
           | and continues to do what it's supposed to do as the codebase
           | evolves over time.
        
         | arnvald wrote:
         | I tried that with Cody (from Sourcegraph) using probably o1
         | model and I struggled. I had a long function with a number of
         | conditions and early returns.
         | 
         | At first Cody generated a single test case, then I asked if to
         | consider all possible scenarios in that function (around 10),
         | it generated 5 cases, 3 of which were correct and 2 scenarios
         | were made up (it tried to use nonexistent enum values).
         | 
         | In the end I used it to generate some mocks and a few minor
         | things but then I copy pasted the test myself a few times and
         | changed the values manually.
         | 
         | Did it save me some time in the end? Possibly, but also it
         | caused a lot of frustration. I still hope it gets better
         | because generating tests should be a great use case for LLMs
        
       | fovc wrote:
       | I still don't understand how people are getting value out of AI
       | coders. I've tried really hard and the commits produced are just
       | a step up from garbage. Writing code from scratch is generally
       | decent. But after a few rounds of edits the assistant just starts
       | piling in conditionals into existing functions until it's a rats
       | nest 4 layers deep and 100+ lines long. The other day it got into
       | a loop trying to resolve a type error, where it would make a
       | change, then revert it, then make it again
       | 
       | ETA: Sorry forgot about the relevancy in my rant! The one area
       | where I've found the AIs helpful is enumerating and then creating
       | test cases
        
         | godelski wrote:
         | I have a simple answer for you: most people write garbage
         | code[00]. I know this because I write garbage code but usually
         | less garbage.
         | 
         | A little background...
         | 
         | I got my undergrad in physics (where I fell in love with math),
         | spent some years working, and got really interested in coding
         | and especially ML (especially the math). So I went to grad
         | school. Unsurprising I had major impostor syndrome being
         | surrounded by CS people[0], so I spent a huge amount of time
         | trying to fill in the gap. The real problem was that many of my
         | friends were PL people, so they're math heavy and great
         | programmers. But after teaching a bunch of high level CS
         | classes I realized I wasn't behind. After having to fix a lot
         | of autograders made by my peers, I didn't feel so behind. When
         | my lab grew and I got to meet a lot more ML people, I felt
         | ahead, and confused. I realized the problem: I was trying to be
         | a physicist in CS. Trying to understand things at very
         | fundamental levels and using that to build up, not feeling like
         | I "knew" a topic until I knew that chain. I realized people
         | were just saying they knew at a different threshold.
         | 
         | Back to ML:
         | 
         | Working and researching in ML I've noticed one common flaw.
         | People are ignoring details. I thought HuggingFace would be "my
         | savior" where people would see that their generation outputs
         | weren't nearly the quality you see in papers. But this didn't
         | happen. We cherry picked results and ignored failures. It feels
         | like writing proofs but people only look at the last line (I'd
         | argue this is analogous to code! It's about so much more than
         | the output! The steps are the thing that matters).
         | 
         | So there's two camps of ML people now: the hype people and "the
         | skeptics" (interestingly there's a large population of people
         | with physics and math backgrounds here). I put the latter in
         | quotes because we're not trying to stop ML progress. I'd argue
         | we're trying to make it! The argument is we need to recognize
         | flaws so we know what needs to be fixed. This is why Francois
         | Chollet made the claim that GPT has delayed progress towards
         | AGI. Because we are doing the same thing that caused the last
         | AI winter: putting all our eggs in one basket. We've made it
         | hard to pursue other ideas and models because to get published
         | you need to beat benchmarks (good luck doing so out of the gate
         | and without thousands of GPUs). Because we don't look at the
         | limitations in benchmarks. Because we don't even check for God
         | damn information spoilage anymore. Even HumanEval is littered
         | with spoilage, and obviously so...
         | 
         | There's tons of uses for LLMs and ML systems. My "rage" (as
         | with many others) is more about over promising. Because we know
         | if you don't fulfill those promises quickly, sentiment turns
         | against you and funding quickly goes away. Just look at how
         | even HN went from extremely positive on AI to a similar
         | dichotomy (though the "skeptics" are probably more skeptical
         | than researchers.[1]). Is playing with fire. Prometheus gave it
         | to man to enlighten themselves but they also burned themselves
         | quite frequently.
         | 
         | The answer is:
         | 
         | you evaluate in more detail than others.
         | 
         | [00] of course it is. LLMs replicate average human code.
         | They're optimizers. They optimize fitting data, not fitting
         | optimal symbolic manipulation. If everyone was far better at
         | code, LLMs would be too. That's how they work
         | 
         | [0] boy, us physicists have big egos but CS people give us a
         | run for the money
         | 
         | [1] I have no doubt that AGI can be created. I have no doubt we
         | humans can make it. But I highly doubt LLMs will get us there
         | and we need to look in other directions. I'm not saying we
         | shouldn't stop perusing LLMs, I'm saying don't stop the other
         | research from happening. It's not a zero sum game. Most things
         | in the real world are not (but for some god damn reason we
         | always think it is)
        
         | nzach wrote:
         | > the commits produced
         | 
         | Maybe this is the problem ? I quite like using LLMs for coding,
         | but I don't think we are in a position where a LLM is able to
         | create a reasonable commit.
         | 
         | For me using LLMs for coding is like a pair programming session
         | where YOU are the co-pilot. The AI will happily fill you screen
         | with a lot of text, but you have the responsibility to steer
         | the session. Recently I've been using Supermaven in my editor.
         | I like to think of it as 'LSP on steroids', it's not that smart
         | but is pretty fast and for me this is important.
         | 
         | Another way I use LLMs to help me is by asking open-ended
         | questions to a more capable but slower LLM. Something like
         | "What happens when I read a message from a deleted offset in a
         | Kafka topic?" to o1. Most of the time it doesn't give great
         | answers, but it generally gives good keywords to start a more
         | focused Google search.
        
         | lumost wrote:
         | I use it for the boiler plate, uber google, automated reviewer,
         | rubber duck design reviewer, and junior engineer given
         | extremely precise instructions.
         | 
         | The latest models (o1, Claude sonnet new) are decent at
         | generating code for up to 1k lines. More than that, they start
         | to struggle. On large code bases they lose the plot quickly and
         | generate gibberish. I only use them as code summarizers and as
         | a Google replacement in that context.
        
         | TacticalCoder wrote:
         | > I still don't understand how people are getting value out of
         | AI coders. I've tried really hard and the commits produced are
         | just a step up from garbage.
         | 
         | These aren't mutually exclusive. I pay for ChatGPT. It sucks
         | fat balls at coding but it's okay to do things like _" Bash:
         | ensure exactly two params are passed, 1st one is a dir, 2nd is
         | a file"_. This is slightly faster than writing it myself so
         | it's worth $20 a month but that's about it.
         | 
         | "from now on no explanation, code only" also helps.
         | 
         | Does is still stuck? Definitely. But it's one more tool. I can
         | understand why one wouldn't even bother though.
        
           | Lerc wrote:
           | _> "from now on no explanation, code only" also helps._
           | 
           | Does it? Without using a model with a internal monologue
           | interface, the explanation is the only way for the model to
           | do any long form thinking. I would have thought that
           | requesting an explanation of the code it is about to write
           | would be better. An explanation after it has written the code
           | would be counterproductive, because it would be flavouring
           | the explanation to what it actually wrote instead of what it
           | was wanting to achieve.
        
       ___________________________________________________________________
       (page generated 2024-12-30 23:00 UTC)