[HN Gopher] Empirical Study of Test Generation with LLM's
___________________________________________________________________
Empirical Study of Test Generation with LLM's
Author : nickpsecurity
Score : 31 points
Date : 2024-12-28 16:10 UTC (2 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| nickpsecurity wrote:
| Here's the EvoSuite that they refer to:
|
| https://www.evosuite.org/evosuite/
| bediger4000 wrote:
| Doesn't seem to compare to human generated tests. I'm guessing
| that comparison is mine too favorable.
| nickpsecurity wrote:
| I think humans are probably still better (a) *when they write
| tests and (b) when they know how to. Neither is true for a lot
| of programmers but might happen if automated cheaply.
|
| I also think traditional tools are better for this. The test,
| generation methods include path-based, combinatorial, concolic,
| and adaptive fuzzing. Fire and forget tools that do each one,
| suppressing duplicates, would be helpful.
|
| What would be easier to train developers on are contracts (or
| annotations). Then, add test generation either from those or
| aided by static analysis that leverages them. Then, software
| that analyzes the code, annotates what it can, and then asks
| the user to fill in the holes. The LLM's could turn their
| answers into formal contracts or specs.
|
| That's how I see it happening. Maybe add patch generation into
| that, too. At least two tools that predate LLM's already find
| errors and suggest fixes.
| jitl wrote:
| My cursor workflow for getting tests is to make the test file,
| import the code under test, and then type cmd-k, "unit tests for
| <class>", enter. Add additional cmd-k to prompt for method tests
| and cases as needed.
|
| Pretty basic, would adding more shenanigans get me better
| results?
|
| I try to write doc comments for methods with contracts and it
| seems cursor/claude does a good job reading and testing those.
| peterldowns wrote:
| What value are the tests giving you? Asking truly naively, I
| haven't tried generating tests for code before, and I'm
| skeptical that it would be useful for me.
| jitl wrote:
| The tests provide the same value as usual: early detection of
| coding mistakes, (somewhat) format specification of
| contracts, assurance the code does what it's supposed to do
| and continues to do what it's supposed to do as the codebase
| evolves over time.
| arnvald wrote:
| I tried that with Cody (from Sourcegraph) using probably o1
| model and I struggled. I had a long function with a number of
| conditions and early returns.
|
| At first Cody generated a single test case, then I asked if to
| consider all possible scenarios in that function (around 10),
| it generated 5 cases, 3 of which were correct and 2 scenarios
| were made up (it tried to use nonexistent enum values).
|
| In the end I used it to generate some mocks and a few minor
| things but then I copy pasted the test myself a few times and
| changed the values manually.
|
| Did it save me some time in the end? Possibly, but also it
| caused a lot of frustration. I still hope it gets better
| because generating tests should be a great use case for LLMs
| fovc wrote:
| I still don't understand how people are getting value out of AI
| coders. I've tried really hard and the commits produced are just
| a step up from garbage. Writing code from scratch is generally
| decent. But after a few rounds of edits the assistant just starts
| piling in conditionals into existing functions until it's a rats
| nest 4 layers deep and 100+ lines long. The other day it got into
| a loop trying to resolve a type error, where it would make a
| change, then revert it, then make it again
|
| ETA: Sorry forgot about the relevancy in my rant! The one area
| where I've found the AIs helpful is enumerating and then creating
| test cases
| godelski wrote:
| I have a simple answer for you: most people write garbage
| code[00]. I know this because I write garbage code but usually
| less garbage.
|
| A little background...
|
| I got my undergrad in physics (where I fell in love with math),
| spent some years working, and got really interested in coding
| and especially ML (especially the math). So I went to grad
| school. Unsurprising I had major impostor syndrome being
| surrounded by CS people[0], so I spent a huge amount of time
| trying to fill in the gap. The real problem was that many of my
| friends were PL people, so they're math heavy and great
| programmers. But after teaching a bunch of high level CS
| classes I realized I wasn't behind. After having to fix a lot
| of autograders made by my peers, I didn't feel so behind. When
| my lab grew and I got to meet a lot more ML people, I felt
| ahead, and confused. I realized the problem: I was trying to be
| a physicist in CS. Trying to understand things at very
| fundamental levels and using that to build up, not feeling like
| I "knew" a topic until I knew that chain. I realized people
| were just saying they knew at a different threshold.
|
| Back to ML:
|
| Working and researching in ML I've noticed one common flaw.
| People are ignoring details. I thought HuggingFace would be "my
| savior" where people would see that their generation outputs
| weren't nearly the quality you see in papers. But this didn't
| happen. We cherry picked results and ignored failures. It feels
| like writing proofs but people only look at the last line (I'd
| argue this is analogous to code! It's about so much more than
| the output! The steps are the thing that matters).
|
| So there's two camps of ML people now: the hype people and "the
| skeptics" (interestingly there's a large population of people
| with physics and math backgrounds here). I put the latter in
| quotes because we're not trying to stop ML progress. I'd argue
| we're trying to make it! The argument is we need to recognize
| flaws so we know what needs to be fixed. This is why Francois
| Chollet made the claim that GPT has delayed progress towards
| AGI. Because we are doing the same thing that caused the last
| AI winter: putting all our eggs in one basket. We've made it
| hard to pursue other ideas and models because to get published
| you need to beat benchmarks (good luck doing so out of the gate
| and without thousands of GPUs). Because we don't look at the
| limitations in benchmarks. Because we don't even check for God
| damn information spoilage anymore. Even HumanEval is littered
| with spoilage, and obviously so...
|
| There's tons of uses for LLMs and ML systems. My "rage" (as
| with many others) is more about over promising. Because we know
| if you don't fulfill those promises quickly, sentiment turns
| against you and funding quickly goes away. Just look at how
| even HN went from extremely positive on AI to a similar
| dichotomy (though the "skeptics" are probably more skeptical
| than researchers.[1]). Is playing with fire. Prometheus gave it
| to man to enlighten themselves but they also burned themselves
| quite frequently.
|
| The answer is:
|
| you evaluate in more detail than others.
|
| [00] of course it is. LLMs replicate average human code.
| They're optimizers. They optimize fitting data, not fitting
| optimal symbolic manipulation. If everyone was far better at
| code, LLMs would be too. That's how they work
|
| [0] boy, us physicists have big egos but CS people give us a
| run for the money
|
| [1] I have no doubt that AGI can be created. I have no doubt we
| humans can make it. But I highly doubt LLMs will get us there
| and we need to look in other directions. I'm not saying we
| shouldn't stop perusing LLMs, I'm saying don't stop the other
| research from happening. It's not a zero sum game. Most things
| in the real world are not (but for some god damn reason we
| always think it is)
| nzach wrote:
| > the commits produced
|
| Maybe this is the problem ? I quite like using LLMs for coding,
| but I don't think we are in a position where a LLM is able to
| create a reasonable commit.
|
| For me using LLMs for coding is like a pair programming session
| where YOU are the co-pilot. The AI will happily fill you screen
| with a lot of text, but you have the responsibility to steer
| the session. Recently I've been using Supermaven in my editor.
| I like to think of it as 'LSP on steroids', it's not that smart
| but is pretty fast and for me this is important.
|
| Another way I use LLMs to help me is by asking open-ended
| questions to a more capable but slower LLM. Something like
| "What happens when I read a message from a deleted offset in a
| Kafka topic?" to o1. Most of the time it doesn't give great
| answers, but it generally gives good keywords to start a more
| focused Google search.
| lumost wrote:
| I use it for the boiler plate, uber google, automated reviewer,
| rubber duck design reviewer, and junior engineer given
| extremely precise instructions.
|
| The latest models (o1, Claude sonnet new) are decent at
| generating code for up to 1k lines. More than that, they start
| to struggle. On large code bases they lose the plot quickly and
| generate gibberish. I only use them as code summarizers and as
| a Google replacement in that context.
| TacticalCoder wrote:
| > I still don't understand how people are getting value out of
| AI coders. I've tried really hard and the commits produced are
| just a step up from garbage.
|
| These aren't mutually exclusive. I pay for ChatGPT. It sucks
| fat balls at coding but it's okay to do things like _" Bash:
| ensure exactly two params are passed, 1st one is a dir, 2nd is
| a file"_. This is slightly faster than writing it myself so
| it's worth $20 a month but that's about it.
|
| "from now on no explanation, code only" also helps.
|
| Does is still stuck? Definitely. But it's one more tool. I can
| understand why one wouldn't even bother though.
| Lerc wrote:
| _> "from now on no explanation, code only" also helps._
|
| Does it? Without using a model with a internal monologue
| interface, the explanation is the only way for the model to
| do any long form thinking. I would have thought that
| requesting an explanation of the code it is about to write
| would be better. An explanation after it has written the code
| would be counterproductive, because it would be flavouring
| the explanation to what it actually wrote instead of what it
| was wanting to achieve.
___________________________________________________________________
(page generated 2024-12-30 23:00 UTC)