[HN Gopher] Using LLMs to Generate Fuzzers
       ___________________________________________________________________
        
       Using LLMs to Generate Fuzzers
        
       Author : moyix
       Score  : 123 points
       Date   : 2024-03-09 17:40 UTC (1 days ago)
        
 (HTM) web link (verse.systems)
 (TXT) w3m dump (verse.systems)
        
       | popinman322 wrote:
       | You could likely also combine the LLM with a coverage tool to
       | provide additional guidance when regenerating the fuzzer: "Your
       | fuzzer missed lines XX-YY in the code. Explain why you think the
       | fuzzer missed those lines, describe inputs that might reach those
       | lines in the code, and then update the fuzzer code to match your
       | observations."
       | 
       | This approach could likely also be combined with RL; the code
       | coverage provides a decent reward signal.
        
         | wrsh07 wrote:
         | To me, if it detects bugs (and fixing those makes the others
         | reachable), that seems like a pretty acceptable iterative step
         | 
         | It's less academically pure, but as an engineer who wants to
         | fix bugs it seems ok
        
       | smusamashah wrote:
       | I have kind of pet peeve with people testing LLMs like this these
       | days.
       | 
       | They take whatever it spits out in the first attempt. And then
       | they go on extrapolate this to draw all kinds of conclusions.
       | They forget the output it generated is based on a random seed. A
       | new attempt (with a new seed) is going to give a totally
       | different answer.
       | 
       | If the author has retried that prompt, that new attempt might
       | have generated better code or might have generated lot worse
       | code. You can not draw conclusions from just one answer.
        
         | Vetch wrote:
         | That doesn't seem to be the case here. Reading through the
         | article and twitter thread, the impression I get is that
         | between moyix and the author, a decent amount of time was spent
         | on this. A valid criticism that could have been made is the use
         | of Claude Sonnet but based on the twitter thread, it looks like
         | opus was what @moyix leveraged.
        
           | moyix wrote:
           | Yes - it's a bit hard to follow the various branches of the
           | thread on Twitter (I wasn't really intending this to be more
           | than a 30 minute "hey that's neat" kind of experiment, but
           | people kept suggesting new and interesting things to try :)),
           | but I gave Claude Opus three independent tries at creating a
           | fuzzer for Toby's packet parser, and it consistently missed
           | the fact that it needed to include the sequence number in the
           | CRC calculation.
           | 
           | Once that oversight was pointed out, it did write a decent
           | fuzzer that found the memory safety bugs in do_read and
           | do_write. I also got it to fix those two bugs automatically
           | (by providing it the ASAN output).
        
         | refulgentis wrote:
         | > A new attempt (with a new seed) is going to give a totally
         | different answer
         | 
         | Totally different...I'd posit 5% different, and mostly in
         | trivialities.
         | 
         | It's worth doing an experiment and prompting an LLM with a
         | coding question twice, then seeing how different it is.
         | 
         | For, say, a K-Means clustering algorithm, you're absolutely
         | correct. The initial state is _completely_ dependent on the
         | choice of seed.
         | 
         | With LLMs, the initial state is your prompt + a seed. The
         | prompt _massively_ overwhelms the seed. Then, the nature of the
         | model, predicting probabilities, then the nature of sampling,
         | attempting to minimize surprise, means there 's a powerful
         | forcing function towards answers that share much in common.
         | This is both in theory, and I think you'll see, in practice.
        
           | smusamashah wrote:
           | Depends on the question. If you asked for a small fact, you
           | are going to get almost the same answer every time. But if
           | it's not a factual question, and answer is supposed to be a
           | long tangled one, then the answer is going to depend on what
           | LLM said in the first lines because it is going to stick with
           | that.
           | 
           | e.g LLM might have said for some reason the writing a fuzzer
           | like this isn't possible and then went on presenting some
           | alternatives for tge given task.
           | 
           | I have only experience with GPT-4 via api but I believe at
           | core all these LLMs work the same way.
        
             | refulgentis wrote:
             | You're absolutely correct, in that it's never guaranteed
             | what the next token is.
             | 
             | My pushback is limited to that the theoretical maximal
             | degenerate behavior described in either of your comments is
             | highly improbable in practice, with a lot of givens, such
             | as reasonable parameters, reasonable model.
             | 
             | I.e. it will not
             | 
             | - give totally different answers due to seed changing.
             | 
             | - end up X% of the time, where X > 5 say it is impossible,
             | and the other (100 - X)%, provide some solution.
             | 
             | I have integrated with GPT3.0/GPT3.5/GPT4 and revisions
             | thereof via API, as well as Claude 2 and this week, Claude
             | 3. I wrote a native inference solution that runs, among
             | others, StableLM Zephyr 3B, Mistral 7B, and Mixtral 8x7B,
             | and I wrote code that does inference, step by excruciating
             | step, in a loop, on web via WASM, and via C++, tailored
             | solutions for Android, iOS, macOS, Android, and Windows.
        
               | smusamashah wrote:
               | I still think it depends on the subject you are
               | prompting. If LLM knows that thing very well it will
               | stick to the answer, otherwise it can go in a different
               | direction based on how different initial assessment was.
               | 
               | Yesterday I asked it to write a simple VB script to show
               | a reminder that I will schedule via command line using
               | Windows task schedular. In first attempt it suggested to
               | create VB file for each message based on initial
               | reasoning that I can not pass arguments to VB file like
               | that. It didn't seem correct (confirmed via Google) then
               | resubmitted the same prompt but this time it said that I
               | can simply pass my reminder message as arg to VB script
               | and the next code was based on that. (I don't know VB or
               | ever used task schedular before)
               | 
               | This was GPT-4. You are not wrong about 'maximal
               | degenerate behaviour' but initially generated assumption
               | can lead to different answers overall. Chain of thought
               | prompting stems from this exact behaviour.
        
       | planetis wrote:
       | It seems to overlook that the language model was developed using
       | a large corpora of code, which probably includes structured
       | fuzzers for file formats such as GIF. Plus, the scope of the
       | "unknown" format introduced is limited.
        
         | moyix wrote:
         | The original test of the GIF parser does, but the VRML parser
         | less so and the completely novel packet parser even less so.
         | I'm not quite sure what you mean by the scope of the "unknown"
         | format being limited - it's not the most complex format in the
         | world, but neither is GIF.
         | 
         | Another test to check how much seeing the actual parser code
         | helps is to have it generate a GIF fuzzer _without_ giving it
         | the code:
         | 
         | https://twitter.com/moyix/status/1766135426476064774
         | 
         | And finally, for fun, we can see how it does when we give it
         | the RFC for GIF89a:
         | 
         | https://twitter.com/moyix/status/1766207786751279298
        
       | ttul wrote:
       | I read a lot of niggling comments here about whether Claude was
       | really being smart in writing this GIF fuzzer. Of course it was
       | trained on fuzzer source code. Of course it has read every blog
       | post about esoteric boundary conditions in GIF parsers.
       | 
       | But to bring all of those things together and translate the
       | concepts into working Python code is astonishing. We have just
       | forgotten that a year ago, this achievement would have blown our
       | minds.
       | 
       | I recently had to write an email to my kid's school so that he
       | could get some more support for a learning disability. I fed
       | Claude 3 Opus a copy of his 35 page psychometric testing report
       | along with a couple of his recent report cards and asked it to
       | draft the email for me, making reference to things in the three
       | documents provided. I also suggested it pay special attention to
       | one of the testing results.
       | 
       | The first email draft was ready to send. Sure, I tweaked a thing
       | or two, but this saved me half an hour of digging through dense
       | material written by a psychologist. After verifying that there
       | were no factual errors, I hit "Send." To me, it's still magic.
        
         | asymmetric wrote:
         | Were yout not concerned about the privacy implications of
         | uploading your child's sensitive halth data to a private LLM?
        
           | michaelbuckbee wrote:
           | Not OP, but parent of multiple school-age kids and both:
           | 
           | 1. You're 100% right, there are privacy concerns.
           | 
           | 2. I don't know if they could possibly be worse than the
           | majority of school districts (including my kids) running
           | directly off of Google's Education system (Chromebooks,
           | Google Docs, Gmail etc.).
        
             | flemhans wrote:
             | Can you opt out? Are there privacy-friendly schools?
             | 
             | Could you enroll your child under a fake name? How messed
             | up would they think that is :D
        
           | kelseyfrog wrote:
           | It's important to differentiate concern(a feeling) from
           | choosing to upload or not. In the calculus of benefits and
           | risks, The feeling of concern(potentially leaking PII/health
           | information) may be outweighed by the benefit in education.
           | Even if someone is concerned, they may still see the
           | positives outweigh the risks. It's a subjective decision at
           | the end of the day.
        
       | aaron695 wrote:
       | I don't understand why we are getting LLMs to generate _code_ to
       | create fuzzing data as a  'thing'
       | 
       | Logically LLMs should be quite good at creating the fuzzing data.
       | 
       | To state the obvious why, it's too expensive to use LLMs directly
       | and this way works since they found "4 memory safety bugs and one
       | hang"
       | 
       | But the future we are heading to should be LLMs will directly
       | pentest/test the code. This is where it's interesting and new.
        
         | moyix wrote:
         | I don't think using a language model to generate inputs
         | directly is ever going to be as efficient as writing a little
         | bit of code to do the generation; it's _really_ hard to beat an
         | input generator that can craft thousands of inputs /second.
        
         | yencabulator wrote:
         | For one, it'd be really hard for an LLM to get the CRC32 right,
         | especially when it's in a header before the data it covers.
         | 
         | Then again, this whole approach to fuzzing comes across as
         | kinda naive, at the very least you'd want to use an API of a
         | coverage-guided fuzzer for generating the randomness (and then
         | almost always fixing up CRC32 on top of that, like a human-
         | written wrapper function would).
        
           | dmazzoni wrote:
           | Exactly. If I actually wanted to fuzz this I'd use libfuzzer
           | and manually fix the crc32. An LLM would be useful in helping
           | me write the libfuzzer glue code.
        
       | dmazzoni wrote:
       | Why wouldn't you have an LLM write some code that uses something
       | like libfuzzer instead?
       | 
       | That way you get an efficient, robust coverage-driven fuzzing
       | engine, rather than having an LLM try to reinvent the wheel on
       | that part of the code poorly. Let the LLM help write the
       | boilerplate code for you.
        
         | moyix wrote:
         | They're actually orthogonal approaches - from what I've seen so
         | far the LLM fuzzer generates much higher quality seeds than
         | you'd get even after fuzzing for a while (in the case of the
         | VRML target, even _if_ you start with some valid test files
         | found online), but it 's not as good at generating broken
         | inputs. So the obvious thing to do is have the LLM's fuzzer
         | generate initial seeds that get pretty good coverage and then a
         | traditional coverage-guided fuzzer to further mutate those.
         | 
         | These are still pretty small scale experiments on essentially
         | toy programs, so it remains to be seen if LLMs remain useful on
         | real world programs, but so far it looks pretty promising - and
         | it's a lot less work than writing a new libfuzzer target,
         | especially when the program is one that's not set up with nice
         | in-memory APIs (e.g., that GIF decoder program just uses read()
         | calls distributed all over the program; it would be fairly
         | painful to refactor it to play nicely with libfuzzer).
        
       ___________________________________________________________________
       (page generated 2024-03-10 23:01 UTC)