[HN Gopher] Using LLMs to Generate Fuzzers
___________________________________________________________________
Using LLMs to Generate Fuzzers
Author : moyix
Score : 123 points
Date : 2024-03-09 17:40 UTC (1 days ago)
(HTM) web link (verse.systems)
(TXT) w3m dump (verse.systems)
| popinman322 wrote:
| You could likely also combine the LLM with a coverage tool to
| provide additional guidance when regenerating the fuzzer: "Your
| fuzzer missed lines XX-YY in the code. Explain why you think the
| fuzzer missed those lines, describe inputs that might reach those
| lines in the code, and then update the fuzzer code to match your
| observations."
|
| This approach could likely also be combined with RL; the code
| coverage provides a decent reward signal.
| wrsh07 wrote:
| To me, if it detects bugs (and fixing those makes the others
| reachable), that seems like a pretty acceptable iterative step
|
| It's less academically pure, but as an engineer who wants to
| fix bugs it seems ok
| smusamashah wrote:
| I have kind of pet peeve with people testing LLMs like this these
| days.
|
| They take whatever it spits out in the first attempt. And then
| they go on extrapolate this to draw all kinds of conclusions.
| They forget the output it generated is based on a random seed. A
| new attempt (with a new seed) is going to give a totally
| different answer.
|
| If the author has retried that prompt, that new attempt might
| have generated better code or might have generated lot worse
| code. You can not draw conclusions from just one answer.
| Vetch wrote:
| That doesn't seem to be the case here. Reading through the
| article and twitter thread, the impression I get is that
| between moyix and the author, a decent amount of time was spent
| on this. A valid criticism that could have been made is the use
| of Claude Sonnet but based on the twitter thread, it looks like
| opus was what @moyix leveraged.
| moyix wrote:
| Yes - it's a bit hard to follow the various branches of the
| thread on Twitter (I wasn't really intending this to be more
| than a 30 minute "hey that's neat" kind of experiment, but
| people kept suggesting new and interesting things to try :)),
| but I gave Claude Opus three independent tries at creating a
| fuzzer for Toby's packet parser, and it consistently missed
| the fact that it needed to include the sequence number in the
| CRC calculation.
|
| Once that oversight was pointed out, it did write a decent
| fuzzer that found the memory safety bugs in do_read and
| do_write. I also got it to fix those two bugs automatically
| (by providing it the ASAN output).
| refulgentis wrote:
| > A new attempt (with a new seed) is going to give a totally
| different answer
|
| Totally different...I'd posit 5% different, and mostly in
| trivialities.
|
| It's worth doing an experiment and prompting an LLM with a
| coding question twice, then seeing how different it is.
|
| For, say, a K-Means clustering algorithm, you're absolutely
| correct. The initial state is _completely_ dependent on the
| choice of seed.
|
| With LLMs, the initial state is your prompt + a seed. The
| prompt _massively_ overwhelms the seed. Then, the nature of the
| model, predicting probabilities, then the nature of sampling,
| attempting to minimize surprise, means there 's a powerful
| forcing function towards answers that share much in common.
| This is both in theory, and I think you'll see, in practice.
| smusamashah wrote:
| Depends on the question. If you asked for a small fact, you
| are going to get almost the same answer every time. But if
| it's not a factual question, and answer is supposed to be a
| long tangled one, then the answer is going to depend on what
| LLM said in the first lines because it is going to stick with
| that.
|
| e.g LLM might have said for some reason the writing a fuzzer
| like this isn't possible and then went on presenting some
| alternatives for tge given task.
|
| I have only experience with GPT-4 via api but I believe at
| core all these LLMs work the same way.
| refulgentis wrote:
| You're absolutely correct, in that it's never guaranteed
| what the next token is.
|
| My pushback is limited to that the theoretical maximal
| degenerate behavior described in either of your comments is
| highly improbable in practice, with a lot of givens, such
| as reasonable parameters, reasonable model.
|
| I.e. it will not
|
| - give totally different answers due to seed changing.
|
| - end up X% of the time, where X > 5 say it is impossible,
| and the other (100 - X)%, provide some solution.
|
| I have integrated with GPT3.0/GPT3.5/GPT4 and revisions
| thereof via API, as well as Claude 2 and this week, Claude
| 3. I wrote a native inference solution that runs, among
| others, StableLM Zephyr 3B, Mistral 7B, and Mixtral 8x7B,
| and I wrote code that does inference, step by excruciating
| step, in a loop, on web via WASM, and via C++, tailored
| solutions for Android, iOS, macOS, Android, and Windows.
| smusamashah wrote:
| I still think it depends on the subject you are
| prompting. If LLM knows that thing very well it will
| stick to the answer, otherwise it can go in a different
| direction based on how different initial assessment was.
|
| Yesterday I asked it to write a simple VB script to show
| a reminder that I will schedule via command line using
| Windows task schedular. In first attempt it suggested to
| create VB file for each message based on initial
| reasoning that I can not pass arguments to VB file like
| that. It didn't seem correct (confirmed via Google) then
| resubmitted the same prompt but this time it said that I
| can simply pass my reminder message as arg to VB script
| and the next code was based on that. (I don't know VB or
| ever used task schedular before)
|
| This was GPT-4. You are not wrong about 'maximal
| degenerate behaviour' but initially generated assumption
| can lead to different answers overall. Chain of thought
| prompting stems from this exact behaviour.
| planetis wrote:
| It seems to overlook that the language model was developed using
| a large corpora of code, which probably includes structured
| fuzzers for file formats such as GIF. Plus, the scope of the
| "unknown" format introduced is limited.
| moyix wrote:
| The original test of the GIF parser does, but the VRML parser
| less so and the completely novel packet parser even less so.
| I'm not quite sure what you mean by the scope of the "unknown"
| format being limited - it's not the most complex format in the
| world, but neither is GIF.
|
| Another test to check how much seeing the actual parser code
| helps is to have it generate a GIF fuzzer _without_ giving it
| the code:
|
| https://twitter.com/moyix/status/1766135426476064774
|
| And finally, for fun, we can see how it does when we give it
| the RFC for GIF89a:
|
| https://twitter.com/moyix/status/1766207786751279298
| ttul wrote:
| I read a lot of niggling comments here about whether Claude was
| really being smart in writing this GIF fuzzer. Of course it was
| trained on fuzzer source code. Of course it has read every blog
| post about esoteric boundary conditions in GIF parsers.
|
| But to bring all of those things together and translate the
| concepts into working Python code is astonishing. We have just
| forgotten that a year ago, this achievement would have blown our
| minds.
|
| I recently had to write an email to my kid's school so that he
| could get some more support for a learning disability. I fed
| Claude 3 Opus a copy of his 35 page psychometric testing report
| along with a couple of his recent report cards and asked it to
| draft the email for me, making reference to things in the three
| documents provided. I also suggested it pay special attention to
| one of the testing results.
|
| The first email draft was ready to send. Sure, I tweaked a thing
| or two, but this saved me half an hour of digging through dense
| material written by a psychologist. After verifying that there
| were no factual errors, I hit "Send." To me, it's still magic.
| asymmetric wrote:
| Were yout not concerned about the privacy implications of
| uploading your child's sensitive halth data to a private LLM?
| michaelbuckbee wrote:
| Not OP, but parent of multiple school-age kids and both:
|
| 1. You're 100% right, there are privacy concerns.
|
| 2. I don't know if they could possibly be worse than the
| majority of school districts (including my kids) running
| directly off of Google's Education system (Chromebooks,
| Google Docs, Gmail etc.).
| flemhans wrote:
| Can you opt out? Are there privacy-friendly schools?
|
| Could you enroll your child under a fake name? How messed
| up would they think that is :D
| kelseyfrog wrote:
| It's important to differentiate concern(a feeling) from
| choosing to upload or not. In the calculus of benefits and
| risks, The feeling of concern(potentially leaking PII/health
| information) may be outweighed by the benefit in education.
| Even if someone is concerned, they may still see the
| positives outweigh the risks. It's a subjective decision at
| the end of the day.
| aaron695 wrote:
| I don't understand why we are getting LLMs to generate _code_ to
| create fuzzing data as a 'thing'
|
| Logically LLMs should be quite good at creating the fuzzing data.
|
| To state the obvious why, it's too expensive to use LLMs directly
| and this way works since they found "4 memory safety bugs and one
| hang"
|
| But the future we are heading to should be LLMs will directly
| pentest/test the code. This is where it's interesting and new.
| moyix wrote:
| I don't think using a language model to generate inputs
| directly is ever going to be as efficient as writing a little
| bit of code to do the generation; it's _really_ hard to beat an
| input generator that can craft thousands of inputs /second.
| yencabulator wrote:
| For one, it'd be really hard for an LLM to get the CRC32 right,
| especially when it's in a header before the data it covers.
|
| Then again, this whole approach to fuzzing comes across as
| kinda naive, at the very least you'd want to use an API of a
| coverage-guided fuzzer for generating the randomness (and then
| almost always fixing up CRC32 on top of that, like a human-
| written wrapper function would).
| dmazzoni wrote:
| Exactly. If I actually wanted to fuzz this I'd use libfuzzer
| and manually fix the crc32. An LLM would be useful in helping
| me write the libfuzzer glue code.
| dmazzoni wrote:
| Why wouldn't you have an LLM write some code that uses something
| like libfuzzer instead?
|
| That way you get an efficient, robust coverage-driven fuzzing
| engine, rather than having an LLM try to reinvent the wheel on
| that part of the code poorly. Let the LLM help write the
| boilerplate code for you.
| moyix wrote:
| They're actually orthogonal approaches - from what I've seen so
| far the LLM fuzzer generates much higher quality seeds than
| you'd get even after fuzzing for a while (in the case of the
| VRML target, even _if_ you start with some valid test files
| found online), but it 's not as good at generating broken
| inputs. So the obvious thing to do is have the LLM's fuzzer
| generate initial seeds that get pretty good coverage and then a
| traditional coverage-guided fuzzer to further mutate those.
|
| These are still pretty small scale experiments on essentially
| toy programs, so it remains to be seen if LLMs remain useful on
| real world programs, but so far it looks pretty promising - and
| it's a lot less work than writing a new libfuzzer target,
| especially when the program is one that's not set up with nice
| in-memory APIs (e.g., that GIF decoder program just uses read()
| calls distributed all over the program; it would be fairly
| painful to refactor it to play nicely with libfuzzer).
___________________________________________________________________
(page generated 2024-03-10 23:01 UTC)