[HN Gopher] LLM and Bug Finding: Insights from a $2M Winning Tea...
       ___________________________________________________________________
        
       LLM and Bug Finding: Insights from a $2M Winning Team in the White
       House's AIxCC
        
       Author : garlic_chives
       Score  : 154 points
       Date   : 2024-08-16 19:56 UTC (1 days ago)
        
 (HTM) web link (team-atlanta.github.io)
 (TXT) w3m dump (team-atlanta.github.io)
        
       | garlic_chives wrote:
       | AIxCC is an AI Cyber Challenge launched by DARPA and ARPA-H.
       | 
       | Notably, a zero-day vulnerability in SQLite3 was discovered and
       | patched during the AIxCC semifinals, demonstrating the potential
       | of LLM-based approaches in bug finding.
        
         | rfoo wrote:
         | Notably, an undiscovered trivial NULL pointer dereference in
         | SQLite3's SQL parser was discovered and patched. But yeah, it
         | makes very good marketing material.
        
           | hqzhao wrote:
           | It's not a critical issue, but it was surprising since we
           | didn't know that SQLite3 would be one of the challenges
           | before the competition.
        
         | hypeatei wrote:
         | Is there any write ups or CVE pages on that vulnerability? From
         | a quick search, I can't find anything.
        
       | hqzhao wrote:
       | I'm part of the team, and we used LLM agents extensively for
       | smart bug finding and patching. I'm happy to discuss some
       | insights, and share all of the approaches after grand final :)
        
         | doctorpangloss wrote:
         | Everyone thinks bug bounties should be higher. How high should
         | they be? Who should pay for them?
        
           | hqzhao wrote:
           | It really depends on the target and the quality of the
           | vulnerability. For example, low-quality software on GitHub
           | might not warrant high bug bounties, and that's
           | understandable. However, critical components like KVM, ESXi,
           | WebKit, etc., need to be taken much more seriously.
           | 
           | For vendor-specific software, the responsibility to pay
           | should fall on the vendor. When it comes to open-source
           | software, a foundation funded by the vendors who rely on it
           | for core productivity would be ideal.
           | 
           | For high-quality vulnerabilities, especially those that can
           | demonstrate exploitability without any prerequisites (e.g.,
           | zero-click remote jailbreaks), the bounties should be on par
           | with those offered at competitions like Pwn2Own. :)
        
             | tptacek wrote:
             | Google and Apple bounties on zero-click remotes exceeds the
             | prize amounts I see from Pwn2Own?
        
             | doctorpangloss wrote:
             | It seems really hard for people to like, name some
             | vulnerabilities, name some prices. I'm glad you are playing
             | along. Which scenario makes more sense:
             | The Punchline: Microsoft pays $10m for vulnerabilities like
             | the kind used to exploit SolarWinds and the Azure token
             | audience vulnerability.              The Status Quo:
             | Thousands of people pay CrowdStrike a total of billions of
             | dollars, in exchange for urgent patching when
             | vulnerabilities become known.
             | 
             | Okay, do you see what I am getting at? On the one hand, if
             | you pay bug bounties, the bugs get fixed, and they sure
             | _seem_ expensive. But if you look into how much money is
             | spent on valueless security theatre, it is a total drop in
             | the bucket. But CrowdStrike hires security researchers!
             | 
             | So what should the prices really be? For which
             | vulnerabilities? The SolarWinds issue is probably worth
             | more than $10m, if people are willing to pay 100x more to
             | CrowdStrike for nothing.
        
               | saagarjha wrote:
               | The real question here is who is willing to pay $10
               | million for such a bug.
        
               | tptacek wrote:
               | Nobody. That far exceeds the current market prices of the
               | most in-demand bugs.
        
               | doctorpangloss wrote:
               | What is this market you speak of? Can you link me to it
               | and show me the prices you are talking about? The
               | Microsoft key vulnerability leaked all the State
               | Department emails, and probably a lot more. It could have
               | been used to compromise a lot of Azure. What is
               | comparable?
        
               | necovek wrote:
               | It's not as simple: those billions of dollars are not
               | just for this particular issue, or even just for security
               | support.
               | 
               | It's also a difference between keeping a software
               | engineer on staff and hiring a contractor as needed. One
               | is cheaper for the company even if the hourly rate is
               | higher.
               | 
               | The better question is how we can improve the overall
               | security of the software we write, which this article is
               | more focused on. But we understand that there will be
               | bugs, and security bugs even, no matter how hard we try.
               | 
               | Even DJB (of qmail fame) and Knuth (of TeX and TAOCP
               | fame) pay out bug bounties, and they heavily focus on
               | software correctness over large feature sets.
        
             | logical_person wrote:
             | p2o is pathetically low in comparison to other markets. is
             | your experience limited to legitimate bug bounty programs
             | like that?
        
             | 77pt77 wrote:
             | > KVM, ESXi, WebKit, etc., need to be taken much more
             | seriously.
             | 
             | Openssl
        
           | tptacek wrote:
           | Who thinks bug bounties should be higher? Why? Everybody
           | definitely _does not_ think this.
        
             | vasco wrote:
             | There's always two or three people in every thread
             | repeating the same thing without any understanding of
             | marketplace dynamics. If you ask them how much should it be
             | you also get wild answers that don't reflect reality.
        
         | simonw wrote:
         | What kind of LLM agents did you use?
        
           | hqzhao wrote:
           | Based on popular pre-trained models like GPT-4, Claude
           | Sonnet, and Gemini 1.5, we've built several agents designed
           | to mimic the behaviors and habits of the experts on our team.
           | 
           | Our idea is straightforward: after a decade of auditing code
           | and writing exploits, we've accumulated a wealth of
           | experience. So, why not teach these agents to replicate what
           | we do during bug hunting and exploit writing? Of course, the
           | LLMs themselves aren't sufficient on their own, so we've
           | integrated various program analysis techniques to augment the
           | models and help the agents understand more complex and
           | esoteric code.
        
             | simonw wrote:
             | When you call these things "agents" what do you mean by
             | that? Is this a system prompt combined with some defined
             | tools, or is it a different definition?
        
               | tinco wrote:
               | An agent in this context is software that does LLM prompt
               | results to determine its next action, often looping to
               | iteratively get to a good result.
        
             | dogma1138 wrote:
             | Are you going to publish your RAG strategy?
        
         | adragos wrote:
         | Hey, congrats on getting to the finals of AIxCC!
         | 
         | Have you tested your CRS on weekend CTFs? I'm curious how well
         | it'd be able to perform compared to other teams
        
           | hqzhao wrote:
           | Thanks!
           | 
           | We haven't tested it yet. Regarding CTFs, I have some
           | experience. I'm a member of the Tea Deliverers CTF team, and
           | I participated in the DARPA CGC CTF back in 2016 with team
           | b1o0p.
           | 
           | There are a few issues that make it challenging to directly
           | apply our AIxCC approaches to CTF challenges:
           | 
           | 1. *Format Compatibility:* This year's DEFCON CTF finals
           | didn't follow a uniform format. The challenges were complex
           | and involved formats like a Lua VM running on a custom
           | Verilog simulator. Our system, however, is designed for
           | source code repositories like Git repos.
           | 
           | 2. *Binary vs. Source Code:* CTFs are heavily binary-
           | oriented, whereas AIxCC is focused on source code. In CTFs,
           | reverse engineering binaries is often required, but our
           | system isn't equipped to handle that yet. We are, however,
           | interested in supporting binary analysis in the future!
        
         | wslh wrote:
         | Congrats! ELI5: what insights do you have NOW that were not
         | published/researched extensively in academic papers and/or
         | publicly discussed yet?
        
       | rockskon wrote:
       | The AIxcc booth felt like it was meant for a tradeshow as opposed
       | to being a place where someone could learn something.
        
         | hqzhao wrote:
         | I heard that the AIxCC booth prepared the same challenges for
         | the audience to solve manually, but I didn't check the details.
         | 
         | I believe there will be even more cool stuff in next year's
         | grand final. If you want to get a sense of what to expect,
         | check out the DARPA CGC from 2016. :)
        
           | rockskon wrote:
           | I hope that booth is gone for good. Def Con doesn't need
           | marketers with a blank check putting a booth there. Leave
           | that garbage at Black Hat.
        
             | rockskon wrote:
             | To clarify - I hope your "more cool stuff" doesn't mean
             | more fog machines and LED strips. And some of the companies
             | that seemed to ride DARPA's coattails there made my skin
             | crawl. No slight on DARPA themselves.
        
       | wslh wrote:
       | BTW, have you seen the new LLMsic offensive tools such as XBOW
       | [1]? They just received a founding round from Sequoia Capital
       | [2].
       | 
       | [1] https://xbow.com/
       | 
       | [2] https://www.sequoiacap.com/article/partnering-with-xbow-
       | the-...
        
       | sim7c00 wrote:
       | this is really impressive work. coverage guided and especially
       | directed fuzing can be extremely difficult. its mentioned fuzzing
       | is not a dumb technique. I think the classical idea is kind of
       | dumb, in the sense of 'dumb fuzzers' but these days there is tons
       | of intelligence built around it now aand poured into it, but i've
       | always thought its now beyond the classic idea of fuzz testing. i
       | had colleagues who poured their soul into trying to use git
       | commit info etc. to try and help find potentially bad code paths
       | and then coverage guided fuzzing trying to get in there. I really
       | like the little note at the bottom about this. adding such layers
       | kind of does make it lean towards machine learning nowadays, and
       | id think perhaps fuzzing is not the right term anymore. i dont
       | think many people are actually still simply generating random
       | inputs and trying to crash programs like that.
       | 
       | this is really exciting new progress around this type of field
       | guys. well done! cant wait to see what new tools and techniques
       | will be yielded from all of this research.
       | 
       | Will you guys be open to implementing something around libafl++
       | perhaps? i remember we worked with that extensively. As a lot of
       | shops use that already it might be cool to look at integration
       | into such tools or would you think this deviates so far it'll
       | amount to a new kind of tool entirely? Also, the work on datasets
       | might be really valuable to other researchers. there was a
       | mention of wasted work but labeled sets of data around cve, bug
       | and patch commits can help a lot of folks if theres new data in
       | there.
       | 
       | this kind of makes me miss having my head in this space :D cool
       | stuff and massive congrats on being finalists. thanks for the
       | extensive writeup!
        
       | deeznuttynutz wrote:
       | What's the good word!!
        
       ___________________________________________________________________
       (page generated 2024-08-17 23:01 UTC)