[HN Gopher] XBOW, an autonomous penetration tester, has reached ...
       ___________________________________________________________________
        
       XBOW, an autonomous penetration tester, has reached the top spot on
       HackerOne
        
       Author : summarity
       Score  : 122 points
       Date   : 2025-06-24 15:53 UTC (7 hours ago)
        
 (HTM) web link (xbow.com)
 (TXT) w3m dump (xbow.com)
        
       | ikmckenz wrote:
       | Related: https://arstechnica.com/gadgets/2025/05/open-source-
       | project-...
        
         | moyix wrote:
         | The main difference is that all of the vulnerabilities reported
         | here are real, many quite critical (XXE, RCE, SQLi, etc.). To
         | be fair there were definitely a lot of XSS, but the main reason
         | for that is that it's a really common vulnerability.
        
       | andrewstuart wrote:
       | All the fun vanishes.
        
         | tptacek wrote:
         | Good. It was in the way.
        
           | kiitos wrote:
           | In the way of what?
        
             | tptacek wrote:
             | Getting more bugs fixed.
        
               | kiitos wrote:
               | > Getting more bugs fixed.
               | 
               | OK.. but "getting more bugs fixed" isn't any kind of
               | objective success metric for, well, anything, right?
               | 
               | It's fine if you want to use it as a KPI for your
               | specific thing! But it's not like it's some global KPI
               | for everyone?
        
               | tptacek wrote:
               | It's very specifically the objective of a security bug
               | bounty.
        
       | nottorp wrote:
       | Oh, there are competitions for finding vulnerabilities in
       | software?
       | 
       | That would explain why there's news every day that the world will
       | end because someone discovered something that "could" be used if
       | you already had local root...
       | 
       | Did that article presenting people trusting external input too
       | much as json parser vulnerabilities make it to this competition?
        
         | bryant wrote:
         | Further reading:
         | https://en.wikipedia.org/wiki/Bug_bounty_program
        
       | ryandrake wrote:
       | Receiving hundreds of AI generated bug reports would be so
       | demoralizing and probably turn me off from maintaining an open
       | source project forever. I think developers are going to
       | eventually need tools to filter out slop. If you didn't take the
       | time to write it, why should I take the time to read it?
        
         | triknomeister wrote:
         | Eventually projects who can afford the smugness are going to
         | charge people to be able to talk to open source developers.
        
           | tough wrote:
           | isnt that called enterprise support / consulting
        
             | triknomeister wrote:
             | This is without the enterprise.
        
               | tough wrote:
               | gotchu, maybe i could see github donations enabling issue
               | creation or wahtever in the future idk
               | 
               | but foss is foss, i guess source available doesnt mean we
               | have to read your messages see sqlite (wont even take
               | PR's lol)
        
         | jgalt212 wrote:
         | One would think if AI can generate the slop it could also
         | triage the slop.
        
           | err4nt wrote:
           | How does it know the difference?
        
             | scubbo wrote:
             | I'm still on the AI-skeptic side of the spectrum (though
             | shifting more towards "it has some useful applications"),
             | but, I think the easy answer is - if different
             | models/prompts are used in generation than in
             | quality-/correctness-checking.
        
             | jgalt212 wrote:
             | I think Claude, given enough time to mull it over, could
             | probably come up with some sort of bug severity score.
        
         | teeray wrote:
         | You see, the dream is another AI that reads the report and
         | writes the issue in the bug tracker. Then another AI implements
         | the fix. A third AI then reviews the code and approves and
         | merges it. All without human interaction! Once CI releases the
         | fix, the first AI can then find the same vulnerability plus a
         | few new and exciting ones.
        
           | dingnuts wrote:
           | This is completely absurd. If generating code is reliable,
           | you can have one generator make the change, and then merge
           | and release it with traditional software.
           | 
           | If it's not reliable, how can you rely on the written issue
           | to be correct, or the review, and so how does that benefit
           | you over just blindly merging whatever changes are created by
           | the model?
        
             | tempodox wrote:
             | Making sense is not required as long as "AI" vendors sell
             | subscriptions.
        
             | croes wrote:
             | That's why parent wrote it's a dream.
             | 
             | It's not real.
             | 
             | But you can bet someone will sell that as the solution.
        
         | Nicook wrote:
         | Open source maintainers have been complaining about this for a
         | while. https://sethmlarson.dev/slop-security-reports. I'm
         | assuming the proliferation of AI will have some significant
         | changes on/already has had for open source projects.
        
         | tptacek wrote:
         | These aren't like Github Issues reports; they're bug bounty
         | programs, specifically stood up to soak up incoming reports
         | from anonymous strangers looking to make money on their
         | submissions, with the premise being that enough of those
         | reports will drive specific security goals (the scope of each
         | program is, for smart vendors, tailored to engineering goals
         | they have internally) to make it worthwhile.
        
           | ryandrake wrote:
           | Got it! The financial incentive will probably turn out to be
           | a double edged sword. Maybe in the pre-AI age, it's By Design
           | to drive those goals, but I bet the ability to automate
           | submissions will inevitably alter the rules of these
           | programs.
           | 
           | I think within the next 5 years or so, we are going to see a
           | societal pattern repeating: any program that rewards human
           | ingenuity and input will become industrialized by AI to the
           | point where it becomes a cottage industry of companies
           | flooding every program with 99% AI submissions. What used to
           | be lone wolves or small groups of humans working on bounties
           | will become truckloads of AI generated "stuff" trying to
           | maximize revenue.
        
             | dcminter wrote:
             | I'm wary of a lot of AI stuff, but here:
             | 
             | > _What used to be lone wolves or small groups of humans
             | working on bounties will become truckloads of AI generated
             | "stuff" trying to maximize revenue._
             | 
             | You're objecting to the wrong thing. The purpose of a bug
             | bounty programme is not to provide a cottage industry for
             | security artisans - it's to flush out security
             | vulnerabilities.
             | 
             | There are reasonable objections to AI automation in this
             | space, but this is not one of them.
        
             | t0mas88 wrote:
             | Might be fixable by adding a $ 100 submission fee that is
             | returned when you're proving working exploit code. Would
             | make the Curl team a lot of money.
        
         | moyix wrote:
         | All of these reports came with executable proof of the
         | vulnerabilities - otherwise, as you say, you get flooded with
         | hallucinated junk like the poor curl dev. This is one of the
         | things that makes offensive security an actually good use case
         | for AI - exploits serve as hard evidence that the LLM can't
         | fake.
        
         | bawolff wrote:
         | If you think the AI slop is demoralizing, you should see the
         | human submissions bug bounties get.
         | 
         | There is a reason companies like hackerone exist - its because
         | dealing with the submissions is terrible.
        
       | mkagenius wrote:
       | > XBOW submitted nearly 1,060 vulnerabilities.
       | 
       | Yikes, explains why my manually submitted single vulnerability is
       | taking weeks to triage.
        
         | tptacek wrote:
         | The XBOW people are not randos.
        
           | lcnPylGDnU4H9OF wrote:
           | That's not their point, I think. They're just saying that
           | those nearly 1060 vulnerabilities are being processed so
           | theirs is being ignored (hence "triage").
        
             | tptacek wrote:
             | If that's all they're saying then there isn't much to do
             | with the sentiment; if you're legit-finding #1061 after
             | legit-findings #1-#1060, that's just life in the NFL. I
             | took instead the meaning that the findings ahead of them
             | were less than legit.
        
               | croes wrote:
               | Whether it is legit-finding is precisely what needs to be
               | checked, but you're at spot 1061.
               | 
               | >130 resolved
               | 
               | >303 were classified as Triaged
               | 
               | >33 reports marked as new
               | 
               | >125 remain pending
               | 
               | >208 were marked as duplicates
               | 
               | >209 as informative
               | 
               | >36 not applicable
               | 
               | 20% bind a lot of resources if you have a high input on
               | submissions and the numbers will rise
        
               | tptacek wrote:
               | I think some context I probably don't share with the rest
               | of this thread is that the average quality of a Hacker
               | One submission is _incredibly_ low. Like however bad you
               | think the median bounty submission is, it 's worse; think
               | "people threatening to take you to court for not paying
               | them for their report that they can 'XSS' you with the
               | Chrome developer console".
        
               | croes wrote:
               | We'll get this low quality submissions with AI too.
               | 
               | The problem is that the people who know how to use AI
               | properly will slower and more careful in their
               | submissions.
               | 
               | Many others won't, so we'll get lots of noise hiding the
               | real issues. AI makes it easy to produce many bad results
               | in short time.
        
               | tptacek wrote:
               | Everyone already agrees with that; the interesting
               | argument here is that it also makes it easy to produce
               | many good results in short time.
        
               | croes wrote:
               | But the good ones don't have the same output rate because
               | they are checked by humans before submission.
               | 
               | They are faster than the purely manual ones but can't
               | beat the AI created bad ones neither in speed nor
               | numbers.
               | 
               | It's like the IT security version of the Gish gallop.
        
               | peanut-walrus wrote:
               | My favorite one I've seen is "open redirect when you
               | change the domain name in the browser address bar". This
               | was submitted twice several years apart by two different
               | people.
        
               | aspenmayer wrote:
               | I can't speak to the average quality of submissions, as
               | I've only made one to HackerOne myself iirc. I don't even
               | consider myself good at coding or aware of how to file a
               | bug report or bounty submission. I reported that on iOS
               | Coinbase app, that if you were on a VPN, the Coinbase app
               | PIN simply didn't exist anymore, and did not appear in
               | the settings as enabled either. I included a full video
               | of this occurring and it seemed reproducible. The
               | Coinbase person said that this was not an issue because
               | you would already need access to the physical device and
               | know the iOS passcode; relevant to this is that at the
               | time and maybe now, the Coinbase iOS app didn't hook the
               | iOS passcode for access control, like Signal or other
               | apps do, but instead has its own app passcode. The fact
               | that this was circumventable by adding and connecting to
               | any VPN on the same iOS device seemed like a bug in the
               | implementation, even if it is the code working as
               | written. The issue was closed and I lost 5 HackerRank I
               | think the points are called. It felt very hostile to my
               | efforts that I lost points, since I don't think that was
               | justified. Perhaps that is just how the platform works
               | for denied bug reports on HackerOne, but I have no way of
               | knowing that, as the Coinbase report is the only time I
               | used the platform.
        
               | lcnPylGDnU4H9OF wrote:
               | > there isn't much to do with the sentiment
               | 
               | I see what you're saying but I think a more charitable
               | interpretation can be made. They may be amazed that so
               | many bug reports are being generated by such a reputable
               | group. Looking at your initial reply, perhaps a more
               | constructive comment could be one that joins them in
               | excitement (even if that assumption is erroneous) and
               | expanding on why you think it is exciting (e.g. this
               | group's reputation for quality).
        
               | stronglikedan wrote:
               | > I took instead the meaning that the findings ahead of
               | them were less than legit.
               | 
               | I took instead the opposite - that they were no longer
               | shocked that it was taking so long once they found out
               | why, as they knew who they were and understood.
        
       | jekwoooooe wrote:
       | They should ban this or else they will get swallowed up and
       | companies will stop working with them. The last thing I want is a
       | bunch of llm slop sent to me faster than a human would
        
         | fredfish wrote:
         | As long as they maintain a history per account and discourage
         | gaming with new accounts, I don't see why anyone would want
         | slop that performed lower just because the slop was manual. (I
         | just had someone tell me that they wished the nonsensical
         | bounty submissions they triaged were at least being fixed up
         | with gpt3.)
        
         | danmcs wrote:
         | HackerOne was already useless years before LLMs. Vulnerability
         | scanning was already automated.
         | 
         | When we put our product on there, roughly 2019, the
         | enterprising hackers ran their scanners, submitted everything
         | they found as the highest possible severity to attempt to
         | maximize their payout, and moved on. We wasted time triaging
         | all the stuff they submitted that was nonsense, got nothing
         | valuable out of the engagement, and dropped HackerOne at the
         | end of the contract.
         | 
         | You'd be much better off contracting a competent engineering
         | security firm to inspect your codebase and infrastructure.
        
           | tptacek wrote:
           | Moreover, I don't think XBOW is likely generating the kind of
           | slop beg bounty people generate. There's some serious work
           | behind this.
        
             | tecleandor wrote:
             | Still they're sending hundreds of reports that are being
             | refused because they are not following the rules of the
             | bounties. So they better work on that.
        
               | tptacek wrote:
               | If you thought human bounty program participants were
               | generally following the rules, or that programs weren't
               | swamped with slop already... at least these are actually
               | pre-triaged vetted findings.
        
               | tecleandor wrote:
               | But I was hoping the idea wasn't "as there's a lot of
               | sloppy posts, we're going to be sloppy too let's flood
               | them". So, use the AI for something useful and at least
               | grep the rules properly. That'd be neat.
        
             | radialstub wrote:
             | Do you have sources for if we want to learn more?
        
               | moyix wrote:
               | We've got a bunch of agent traces on the front page of
               | the web site right now. We also have done writeups on
               | individual vulnerabilities found by the system, mostly in
               | open source right now (we did some fun scans of OSS
               | projects found on Docker Hub). We have a bunch more
               | coming up about the vulns found in bug bounty targets.
               | The latter are bottlenecked by getting approval from the
               | companies affected, unfortunately.
               | 
               | Some of my favorites from what we've released so far:
               | 
               | - Exploitation of an n-day RCE in Jenkins, where the
               | agent managed to figure out the challenge environment was
               | broken and used the RCE exploit to debug the server
               | environment and work around the problem to solve the
               | challenge: https://xbow.com/#debugging--testing--and-
               | refining-a-jenkins...
               | 
               | - Authentication bypass in Scoold that allowed reading
               | the server config (including API keys) and arbitrary file
               | read: https://xbow.com/blog/xbow-scoold-vuln/
               | 
               | - The first post about our HackerOne findings, an XSS in
               | Palo Alto Networks GlobalProtect VPN portal used by a
               | bunch of companies: https://xbow.com/blog/xbow-
               | globalprotect-xss/
        
           | strken wrote:
           | We still get reports for such major issues as "this unused
           | domain held my connection for ten seconds and then timed out,
           | which broke the badly-written SQL injection scanner I found
           | on GitHub and ran without understanding".
        
       | tecleandor wrote:
       | First:
       | 
       | > To bridge that gap, we started dogfooding XBOW in public and
       | private bug bounty programs hosted on HackerOne. We treated it
       | like any external researcher would: no shortcuts, no internal
       | knowledge--just XBOW, running on its own.
       | 
       | Is it dogfooding if you're not doing it to yourself? I'd
       | considerit dogfooding only if they were flooding themselves in AI
       | generated bug reports, not to other people. They're not the ones
       | reviewing them.
       | 
       | Also, honest question: what does "best" means here? The one that
       | has sent the most reports?
        
         | jamessinghal wrote:
         | Their success rates on HackerOne seem widely varying.
         | 22/24 (Valid / Closed) for Walt Disney            3/43 (Valid /
         | Closed) for AT&T
        
           | thaumasiotes wrote:
           | > Their success rate on HackerOne seems widely varying.
           | 
           | Some of that is likely down to company policies; Snapchat's
           | policy, for example, is that nothing is ever marked invalid.
        
             | jamessinghal wrote:
             | Yes, I'm sure anyone with more HackerOne experience can
             | give specifics on the companies' policies. For now, those
             | are the most objective measures of quality we have on the
             | reports.
        
               | moyix wrote:
               | This is discussed in the post - many came down to
               | individual programs' policies e.g. not accepting the
               | vulnerability if it was in a 3rd party product they used
               | (but still hosted by them), duplicates (another
               | researcher reported the same vuln at the same time; not
               | really any way to avoid this), or not accepting some
               | classes of vuln like cache poisoning.
        
           | pclmulqdq wrote:
           | Walt Disney doesn't pay bug bounties. AT&T's bounties go up
           | to $5k, which is decent but still not much. It's possible
           | that the market for bugs is efficient.
        
             | monster_truck wrote:
             | Walt Disney's program covers substantially more surface
             | area, there's 6? publicly traded companies listed there. In
             | addition to covering far fewer domains & apps, AT&T's
             | conditions and exclusions disqualify a lot more.
             | 
             | The market for bounties is a circus, breadcrumbs for free
             | work from people trying to 'make it'. It can safely be
             | analogized to the classic trope of those wanting to work in
             | games getting paid fractional market rates for absurd
             | amounts of QA effort. The number of CVSS vulns with a score
             | above 8 that have floated across the front page of HN in
             | the past year without anyone getting paid tells you that
             | much.
        
       | bgwalter wrote:
       | "XBOW is an enterprise solution. If your company would like a
       | demo, email us at info@xbow.com."
       | 
       | Like any "AI" article, this is an ad.
       | 
       | If you are willing to tolerate a high false positive rate, you
       | can as well use Rational Purify or various analyzers.
        
         | moyix wrote:
         | You should come to my upcoming BlackHat talk on how we did this
         | while avoiding false positives :D
         | 
         | https://www.blackhat.com/us-25/briefings/schedule/#ai-agents...
        
           | tptacek wrote:
           | You should publish the paper quietly here (I'm a Black Hat
           | reviewer, FWIW) so people can see where you're coming from.
           | 
           | I know you've been on HN for awhile, and that you're doing
           | interesting stuff; HN just has a really intense immune system
           | against vendor-y stuff.
        
             | moyix wrote:
             | Yeah, it's been very strange being on the other side of
             | that after 10 years in academia! But it's totally
             | reasonable for people to be skeptical when there's a bunch
             | of money sloshing around.
             | 
             | I'll see if I can get time to do a paper to accompany the
             | BH talk. And hopefully the agent traces of individual vulns
             | will also help.
        
               | tptacek wrote:
               | J'accuse! You were required to do a paper for BH anyways!
               | :)
        
               | moyix wrote:
               | Wait a sec, I thought they were optional?
               | 
               | > White Paper/Slide Deck/Supporting Materials (optional)
               | 
               | > * If you have a completed white paper or draft, slide
               | deck, or other supporting materials, you can optionally
               | provide a link for review by the board.
               | 
               | > * Please note: Submission must be self-contained for
               | evaluation, supporting materials are optional.
               | 
               | > * PDF or online viewable links are preferred, where no
               | authentication/log-in is required.
               | 
               | (From the link on the BHUSA CFP page, which confusingly
               | goes to the BH Asia doc:
               | https://i.blackhat.com/Asia-25/BlackHat-Asia-2025-CFP-
               | Prepar... )
        
               | tptacek wrote:
               | I think you're fine, most people don't take the paper bit
               | seriously. It's not due until the end of July regardless
               | (you don't need a paper to submit for the CFP).
        
               | daeken wrote:
               | The scramble to get your paper done in time is
               | traditional! (And why my final paper for the onity lock
               | hack ended up with an entire section I decided was better
               | off left unsaid; woops)
        
       | mellosouls wrote:
       | Have XBow provided a link to this claim, I could only find:
       | 
       | https://hackerone.com/xbow?type=user
       | 
       | Which shows a different picture. This may not invalidate their
       | claim (best US), but a screenshot can be a bit cherry-picked.
        
       | chc4 wrote:
       | I'm generally pretty bearish on AI security research, and think
       | most people don't know anything about what they're talking about,
       | but XBOW is frankly one of the few legitimately interesting and
       | competent companies in the space, and their writeups and reports
       | have good and well thought out results. Congrats!
        
       | wslh wrote:
       | I'm looking forward to the LLM's ELI5 explanation. If I
       | understand correctly, XBOW is genuinely moving the needle and
       | pushing the state of the art.
       | 
       | Another great reading is [1](2024).
       | 
       | [1] "LLM and Bug Finding: Insights from a $2M Winning Team in the
       | White House's AIxCC":
       | https://news.ycombinator.com/item?id=41269791
        
       | hinterlands wrote:
       | Xbow has really smart people working on it, so they're well-aware
       | of the usual 30-second critiques that come up in this thread. For
       | example, they take specific steps to eliminate false positives.
       | 
       | The #1 spot in the ranking is both more of a deal and less of a
       | deal than it might appear. It's less of a deal in that HackerOne
       | is an economic numbers game. There are countless programs you can
       | sign up for, with varied difficulty levels and payouts. Most of
       | them pay not a whole lot and don't attract top talent in the
       | industry. Instead, they offer supplemental income to infosec-
       | minded school-age kids in the developing world. So I wouldn't
       | read this as "Xbow is the best bug hunter in the US". That's a
       | bit of a marketing gimmick.
       | 
       | But this is also not a particularly meaningful objective. The
       | problem is that there's a lot of low-hanging bugs that need
       | squashing and it's hard to allocate sufficient resources to that.
       | Top infosec talent doesn't want to do it (and there's not enough
       | of it). Consulting companies can do it, but they inevitably end
       | up stretching themselves too thin, so the coverage ends up being
       | hit-and-miss. There's a huge market for tools that can find easy
       | bugs cheaply and without too many false positives.
       | 
       | I personally don't doubt that LLMs and related techniques are
       | well-tailored for this task, completely independent of whether
       | they can outperform leading experts. But there are skeptics, so I
       | think this is an important real-world result.
        
         | absurdo wrote:
         | > so they're well-aware of the usual 30-second critiques that
         | come up in this thread.
         | 
         | Succinct description of HN. It's a damn shame.
        
         | normie3000 wrote:
         | > Top infosec talent doesn't want to do it (and there's not
         | enough of it).
         | 
         | What is the top talent spending its time on?
        
           | hinterlands wrote:
           | Vulnerability researchers? For public projects, there's a
           | strong preference for prestige stuff: ecosystem-wide
           | vulnerabilities, new attack techniques, attacking cool new
           | tech (e.g., self-driving cars).
           | 
           | To pay bills: often working for tier A tech companies on
           | intellectually-stimulating projects, such as novel
           | mitigations, proprietary automation, etc. Or doing lucrative
           | consulting / freelance work. Generally not triaging Nessus
           | results 9-to-5.
        
           | tptacek wrote:
           | Specialized bug-hunting.
        
       | martinald wrote:
       | This does not surprise me. In a couple of 'legacy' open source
       | projects I found DoS attacks within 10 minutes, with a working
       | PoC. It crashed the server entirely. I suspect with more
       | prompting it could have found RCE but it was an idle shower
       | thought to try.
       | 
       | While niche and not widely used; there are at least thousands of
       | publicly available servers for each of these projects.
       | 
       | I genuinely think this is one of the biggest near term issues
       | with AI. Even if we get great AI "defence" tooling, there are
       | just so many servers and (IoT or otherwise) devices out there,
       | most of which is not trivial to patch. While a few niche services
       | getting pwned isn't probably a big deal, a million niche services
       | all getting pwned in quick succession is likely to cause huge
       | disruption. There is so much code out there that hasn't been
       | remotely security checked.
       | 
       | Maybe the end solution is some sort of LLM based "WAF" that
       | inspects all traffic that ISPs deploy.
        
       ___________________________________________________________________
       (page generated 2025-06-24 23:00 UTC)