[HN Gopher] XBOW, an autonomous penetration tester, has reached ...
___________________________________________________________________
XBOW, an autonomous penetration tester, has reached the top spot on
HackerOne
Author : summarity
Score : 122 points
Date : 2025-06-24 15:53 UTC (7 hours ago)
(HTM) web link (xbow.com)
(TXT) w3m dump (xbow.com)
| ikmckenz wrote:
| Related: https://arstechnica.com/gadgets/2025/05/open-source-
| project-...
| moyix wrote:
| The main difference is that all of the vulnerabilities reported
| here are real, many quite critical (XXE, RCE, SQLi, etc.). To
| be fair there were definitely a lot of XSS, but the main reason
| for that is that it's a really common vulnerability.
| andrewstuart wrote:
| All the fun vanishes.
| tptacek wrote:
| Good. It was in the way.
| kiitos wrote:
| In the way of what?
| tptacek wrote:
| Getting more bugs fixed.
| kiitos wrote:
| > Getting more bugs fixed.
|
| OK.. but "getting more bugs fixed" isn't any kind of
| objective success metric for, well, anything, right?
|
| It's fine if you want to use it as a KPI for your
| specific thing! But it's not like it's some global KPI
| for everyone?
| tptacek wrote:
| It's very specifically the objective of a security bug
| bounty.
| nottorp wrote:
| Oh, there are competitions for finding vulnerabilities in
| software?
|
| That would explain why there's news every day that the world will
| end because someone discovered something that "could" be used if
| you already had local root...
|
| Did that article presenting people trusting external input too
| much as json parser vulnerabilities make it to this competition?
| bryant wrote:
| Further reading:
| https://en.wikipedia.org/wiki/Bug_bounty_program
| ryandrake wrote:
| Receiving hundreds of AI generated bug reports would be so
| demoralizing and probably turn me off from maintaining an open
| source project forever. I think developers are going to
| eventually need tools to filter out slop. If you didn't take the
| time to write it, why should I take the time to read it?
| triknomeister wrote:
| Eventually projects who can afford the smugness are going to
| charge people to be able to talk to open source developers.
| tough wrote:
| isnt that called enterprise support / consulting
| triknomeister wrote:
| This is without the enterprise.
| tough wrote:
| gotchu, maybe i could see github donations enabling issue
| creation or wahtever in the future idk
|
| but foss is foss, i guess source available doesnt mean we
| have to read your messages see sqlite (wont even take
| PR's lol)
| jgalt212 wrote:
| One would think if AI can generate the slop it could also
| triage the slop.
| err4nt wrote:
| How does it know the difference?
| scubbo wrote:
| I'm still on the AI-skeptic side of the spectrum (though
| shifting more towards "it has some useful applications"),
| but, I think the easy answer is - if different
| models/prompts are used in generation than in
| quality-/correctness-checking.
| jgalt212 wrote:
| I think Claude, given enough time to mull it over, could
| probably come up with some sort of bug severity score.
| teeray wrote:
| You see, the dream is another AI that reads the report and
| writes the issue in the bug tracker. Then another AI implements
| the fix. A third AI then reviews the code and approves and
| merges it. All without human interaction! Once CI releases the
| fix, the first AI can then find the same vulnerability plus a
| few new and exciting ones.
| dingnuts wrote:
| This is completely absurd. If generating code is reliable,
| you can have one generator make the change, and then merge
| and release it with traditional software.
|
| If it's not reliable, how can you rely on the written issue
| to be correct, or the review, and so how does that benefit
| you over just blindly merging whatever changes are created by
| the model?
| tempodox wrote:
| Making sense is not required as long as "AI" vendors sell
| subscriptions.
| croes wrote:
| That's why parent wrote it's a dream.
|
| It's not real.
|
| But you can bet someone will sell that as the solution.
| Nicook wrote:
| Open source maintainers have been complaining about this for a
| while. https://sethmlarson.dev/slop-security-reports. I'm
| assuming the proliferation of AI will have some significant
| changes on/already has had for open source projects.
| tptacek wrote:
| These aren't like Github Issues reports; they're bug bounty
| programs, specifically stood up to soak up incoming reports
| from anonymous strangers looking to make money on their
| submissions, with the premise being that enough of those
| reports will drive specific security goals (the scope of each
| program is, for smart vendors, tailored to engineering goals
| they have internally) to make it worthwhile.
| ryandrake wrote:
| Got it! The financial incentive will probably turn out to be
| a double edged sword. Maybe in the pre-AI age, it's By Design
| to drive those goals, but I bet the ability to automate
| submissions will inevitably alter the rules of these
| programs.
|
| I think within the next 5 years or so, we are going to see a
| societal pattern repeating: any program that rewards human
| ingenuity and input will become industrialized by AI to the
| point where it becomes a cottage industry of companies
| flooding every program with 99% AI submissions. What used to
| be lone wolves or small groups of humans working on bounties
| will become truckloads of AI generated "stuff" trying to
| maximize revenue.
| dcminter wrote:
| I'm wary of a lot of AI stuff, but here:
|
| > _What used to be lone wolves or small groups of humans
| working on bounties will become truckloads of AI generated
| "stuff" trying to maximize revenue._
|
| You're objecting to the wrong thing. The purpose of a bug
| bounty programme is not to provide a cottage industry for
| security artisans - it's to flush out security
| vulnerabilities.
|
| There are reasonable objections to AI automation in this
| space, but this is not one of them.
| t0mas88 wrote:
| Might be fixable by adding a $ 100 submission fee that is
| returned when you're proving working exploit code. Would
| make the Curl team a lot of money.
| moyix wrote:
| All of these reports came with executable proof of the
| vulnerabilities - otherwise, as you say, you get flooded with
| hallucinated junk like the poor curl dev. This is one of the
| things that makes offensive security an actually good use case
| for AI - exploits serve as hard evidence that the LLM can't
| fake.
| bawolff wrote:
| If you think the AI slop is demoralizing, you should see the
| human submissions bug bounties get.
|
| There is a reason companies like hackerone exist - its because
| dealing with the submissions is terrible.
| mkagenius wrote:
| > XBOW submitted nearly 1,060 vulnerabilities.
|
| Yikes, explains why my manually submitted single vulnerability is
| taking weeks to triage.
| tptacek wrote:
| The XBOW people are not randos.
| lcnPylGDnU4H9OF wrote:
| That's not their point, I think. They're just saying that
| those nearly 1060 vulnerabilities are being processed so
| theirs is being ignored (hence "triage").
| tptacek wrote:
| If that's all they're saying then there isn't much to do
| with the sentiment; if you're legit-finding #1061 after
| legit-findings #1-#1060, that's just life in the NFL. I
| took instead the meaning that the findings ahead of them
| were less than legit.
| croes wrote:
| Whether it is legit-finding is precisely what needs to be
| checked, but you're at spot 1061.
|
| >130 resolved
|
| >303 were classified as Triaged
|
| >33 reports marked as new
|
| >125 remain pending
|
| >208 were marked as duplicates
|
| >209 as informative
|
| >36 not applicable
|
| 20% bind a lot of resources if you have a high input on
| submissions and the numbers will rise
| tptacek wrote:
| I think some context I probably don't share with the rest
| of this thread is that the average quality of a Hacker
| One submission is _incredibly_ low. Like however bad you
| think the median bounty submission is, it 's worse; think
| "people threatening to take you to court for not paying
| them for their report that they can 'XSS' you with the
| Chrome developer console".
| croes wrote:
| We'll get this low quality submissions with AI too.
|
| The problem is that the people who know how to use AI
| properly will slower and more careful in their
| submissions.
|
| Many others won't, so we'll get lots of noise hiding the
| real issues. AI makes it easy to produce many bad results
| in short time.
| tptacek wrote:
| Everyone already agrees with that; the interesting
| argument here is that it also makes it easy to produce
| many good results in short time.
| croes wrote:
| But the good ones don't have the same output rate because
| they are checked by humans before submission.
|
| They are faster than the purely manual ones but can't
| beat the AI created bad ones neither in speed nor
| numbers.
|
| It's like the IT security version of the Gish gallop.
| peanut-walrus wrote:
| My favorite one I've seen is "open redirect when you
| change the domain name in the browser address bar". This
| was submitted twice several years apart by two different
| people.
| aspenmayer wrote:
| I can't speak to the average quality of submissions, as
| I've only made one to HackerOne myself iirc. I don't even
| consider myself good at coding or aware of how to file a
| bug report or bounty submission. I reported that on iOS
| Coinbase app, that if you were on a VPN, the Coinbase app
| PIN simply didn't exist anymore, and did not appear in
| the settings as enabled either. I included a full video
| of this occurring and it seemed reproducible. The
| Coinbase person said that this was not an issue because
| you would already need access to the physical device and
| know the iOS passcode; relevant to this is that at the
| time and maybe now, the Coinbase iOS app didn't hook the
| iOS passcode for access control, like Signal or other
| apps do, but instead has its own app passcode. The fact
| that this was circumventable by adding and connecting to
| any VPN on the same iOS device seemed like a bug in the
| implementation, even if it is the code working as
| written. The issue was closed and I lost 5 HackerRank I
| think the points are called. It felt very hostile to my
| efforts that I lost points, since I don't think that was
| justified. Perhaps that is just how the platform works
| for denied bug reports on HackerOne, but I have no way of
| knowing that, as the Coinbase report is the only time I
| used the platform.
| lcnPylGDnU4H9OF wrote:
| > there isn't much to do with the sentiment
|
| I see what you're saying but I think a more charitable
| interpretation can be made. They may be amazed that so
| many bug reports are being generated by such a reputable
| group. Looking at your initial reply, perhaps a more
| constructive comment could be one that joins them in
| excitement (even if that assumption is erroneous) and
| expanding on why you think it is exciting (e.g. this
| group's reputation for quality).
| stronglikedan wrote:
| > I took instead the meaning that the findings ahead of
| them were less than legit.
|
| I took instead the opposite - that they were no longer
| shocked that it was taking so long once they found out
| why, as they knew who they were and understood.
| jekwoooooe wrote:
| They should ban this or else they will get swallowed up and
| companies will stop working with them. The last thing I want is a
| bunch of llm slop sent to me faster than a human would
| fredfish wrote:
| As long as they maintain a history per account and discourage
| gaming with new accounts, I don't see why anyone would want
| slop that performed lower just because the slop was manual. (I
| just had someone tell me that they wished the nonsensical
| bounty submissions they triaged were at least being fixed up
| with gpt3.)
| danmcs wrote:
| HackerOne was already useless years before LLMs. Vulnerability
| scanning was already automated.
|
| When we put our product on there, roughly 2019, the
| enterprising hackers ran their scanners, submitted everything
| they found as the highest possible severity to attempt to
| maximize their payout, and moved on. We wasted time triaging
| all the stuff they submitted that was nonsense, got nothing
| valuable out of the engagement, and dropped HackerOne at the
| end of the contract.
|
| You'd be much better off contracting a competent engineering
| security firm to inspect your codebase and infrastructure.
| tptacek wrote:
| Moreover, I don't think XBOW is likely generating the kind of
| slop beg bounty people generate. There's some serious work
| behind this.
| tecleandor wrote:
| Still they're sending hundreds of reports that are being
| refused because they are not following the rules of the
| bounties. So they better work on that.
| tptacek wrote:
| If you thought human bounty program participants were
| generally following the rules, or that programs weren't
| swamped with slop already... at least these are actually
| pre-triaged vetted findings.
| tecleandor wrote:
| But I was hoping the idea wasn't "as there's a lot of
| sloppy posts, we're going to be sloppy too let's flood
| them". So, use the AI for something useful and at least
| grep the rules properly. That'd be neat.
| radialstub wrote:
| Do you have sources for if we want to learn more?
| moyix wrote:
| We've got a bunch of agent traces on the front page of
| the web site right now. We also have done writeups on
| individual vulnerabilities found by the system, mostly in
| open source right now (we did some fun scans of OSS
| projects found on Docker Hub). We have a bunch more
| coming up about the vulns found in bug bounty targets.
| The latter are bottlenecked by getting approval from the
| companies affected, unfortunately.
|
| Some of my favorites from what we've released so far:
|
| - Exploitation of an n-day RCE in Jenkins, where the
| agent managed to figure out the challenge environment was
| broken and used the RCE exploit to debug the server
| environment and work around the problem to solve the
| challenge: https://xbow.com/#debugging--testing--and-
| refining-a-jenkins...
|
| - Authentication bypass in Scoold that allowed reading
| the server config (including API keys) and arbitrary file
| read: https://xbow.com/blog/xbow-scoold-vuln/
|
| - The first post about our HackerOne findings, an XSS in
| Palo Alto Networks GlobalProtect VPN portal used by a
| bunch of companies: https://xbow.com/blog/xbow-
| globalprotect-xss/
| strken wrote:
| We still get reports for such major issues as "this unused
| domain held my connection for ten seconds and then timed out,
| which broke the badly-written SQL injection scanner I found
| on GitHub and ran without understanding".
| tecleandor wrote:
| First:
|
| > To bridge that gap, we started dogfooding XBOW in public and
| private bug bounty programs hosted on HackerOne. We treated it
| like any external researcher would: no shortcuts, no internal
| knowledge--just XBOW, running on its own.
|
| Is it dogfooding if you're not doing it to yourself? I'd
| considerit dogfooding only if they were flooding themselves in AI
| generated bug reports, not to other people. They're not the ones
| reviewing them.
|
| Also, honest question: what does "best" means here? The one that
| has sent the most reports?
| jamessinghal wrote:
| Their success rates on HackerOne seem widely varying.
| 22/24 (Valid / Closed) for Walt Disney 3/43 (Valid /
| Closed) for AT&T
| thaumasiotes wrote:
| > Their success rate on HackerOne seems widely varying.
|
| Some of that is likely down to company policies; Snapchat's
| policy, for example, is that nothing is ever marked invalid.
| jamessinghal wrote:
| Yes, I'm sure anyone with more HackerOne experience can
| give specifics on the companies' policies. For now, those
| are the most objective measures of quality we have on the
| reports.
| moyix wrote:
| This is discussed in the post - many came down to
| individual programs' policies e.g. not accepting the
| vulnerability if it was in a 3rd party product they used
| (but still hosted by them), duplicates (another
| researcher reported the same vuln at the same time; not
| really any way to avoid this), or not accepting some
| classes of vuln like cache poisoning.
| pclmulqdq wrote:
| Walt Disney doesn't pay bug bounties. AT&T's bounties go up
| to $5k, which is decent but still not much. It's possible
| that the market for bugs is efficient.
| monster_truck wrote:
| Walt Disney's program covers substantially more surface
| area, there's 6? publicly traded companies listed there. In
| addition to covering far fewer domains & apps, AT&T's
| conditions and exclusions disqualify a lot more.
|
| The market for bounties is a circus, breadcrumbs for free
| work from people trying to 'make it'. It can safely be
| analogized to the classic trope of those wanting to work in
| games getting paid fractional market rates for absurd
| amounts of QA effort. The number of CVSS vulns with a score
| above 8 that have floated across the front page of HN in
| the past year without anyone getting paid tells you that
| much.
| bgwalter wrote:
| "XBOW is an enterprise solution. If your company would like a
| demo, email us at info@xbow.com."
|
| Like any "AI" article, this is an ad.
|
| If you are willing to tolerate a high false positive rate, you
| can as well use Rational Purify or various analyzers.
| moyix wrote:
| You should come to my upcoming BlackHat talk on how we did this
| while avoiding false positives :D
|
| https://www.blackhat.com/us-25/briefings/schedule/#ai-agents...
| tptacek wrote:
| You should publish the paper quietly here (I'm a Black Hat
| reviewer, FWIW) so people can see where you're coming from.
|
| I know you've been on HN for awhile, and that you're doing
| interesting stuff; HN just has a really intense immune system
| against vendor-y stuff.
| moyix wrote:
| Yeah, it's been very strange being on the other side of
| that after 10 years in academia! But it's totally
| reasonable for people to be skeptical when there's a bunch
| of money sloshing around.
|
| I'll see if I can get time to do a paper to accompany the
| BH talk. And hopefully the agent traces of individual vulns
| will also help.
| tptacek wrote:
| J'accuse! You were required to do a paper for BH anyways!
| :)
| moyix wrote:
| Wait a sec, I thought they were optional?
|
| > White Paper/Slide Deck/Supporting Materials (optional)
|
| > * If you have a completed white paper or draft, slide
| deck, or other supporting materials, you can optionally
| provide a link for review by the board.
|
| > * Please note: Submission must be self-contained for
| evaluation, supporting materials are optional.
|
| > * PDF or online viewable links are preferred, where no
| authentication/log-in is required.
|
| (From the link on the BHUSA CFP page, which confusingly
| goes to the BH Asia doc:
| https://i.blackhat.com/Asia-25/BlackHat-Asia-2025-CFP-
| Prepar... )
| tptacek wrote:
| I think you're fine, most people don't take the paper bit
| seriously. It's not due until the end of July regardless
| (you don't need a paper to submit for the CFP).
| daeken wrote:
| The scramble to get your paper done in time is
| traditional! (And why my final paper for the onity lock
| hack ended up with an entire section I decided was better
| off left unsaid; woops)
| mellosouls wrote:
| Have XBow provided a link to this claim, I could only find:
|
| https://hackerone.com/xbow?type=user
|
| Which shows a different picture. This may not invalidate their
| claim (best US), but a screenshot can be a bit cherry-picked.
| chc4 wrote:
| I'm generally pretty bearish on AI security research, and think
| most people don't know anything about what they're talking about,
| but XBOW is frankly one of the few legitimately interesting and
| competent companies in the space, and their writeups and reports
| have good and well thought out results. Congrats!
| wslh wrote:
| I'm looking forward to the LLM's ELI5 explanation. If I
| understand correctly, XBOW is genuinely moving the needle and
| pushing the state of the art.
|
| Another great reading is [1](2024).
|
| [1] "LLM and Bug Finding: Insights from a $2M Winning Team in the
| White House's AIxCC":
| https://news.ycombinator.com/item?id=41269791
| hinterlands wrote:
| Xbow has really smart people working on it, so they're well-aware
| of the usual 30-second critiques that come up in this thread. For
| example, they take specific steps to eliminate false positives.
|
| The #1 spot in the ranking is both more of a deal and less of a
| deal than it might appear. It's less of a deal in that HackerOne
| is an economic numbers game. There are countless programs you can
| sign up for, with varied difficulty levels and payouts. Most of
| them pay not a whole lot and don't attract top talent in the
| industry. Instead, they offer supplemental income to infosec-
| minded school-age kids in the developing world. So I wouldn't
| read this as "Xbow is the best bug hunter in the US". That's a
| bit of a marketing gimmick.
|
| But this is also not a particularly meaningful objective. The
| problem is that there's a lot of low-hanging bugs that need
| squashing and it's hard to allocate sufficient resources to that.
| Top infosec talent doesn't want to do it (and there's not enough
| of it). Consulting companies can do it, but they inevitably end
| up stretching themselves too thin, so the coverage ends up being
| hit-and-miss. There's a huge market for tools that can find easy
| bugs cheaply and without too many false positives.
|
| I personally don't doubt that LLMs and related techniques are
| well-tailored for this task, completely independent of whether
| they can outperform leading experts. But there are skeptics, so I
| think this is an important real-world result.
| absurdo wrote:
| > so they're well-aware of the usual 30-second critiques that
| come up in this thread.
|
| Succinct description of HN. It's a damn shame.
| normie3000 wrote:
| > Top infosec talent doesn't want to do it (and there's not
| enough of it).
|
| What is the top talent spending its time on?
| hinterlands wrote:
| Vulnerability researchers? For public projects, there's a
| strong preference for prestige stuff: ecosystem-wide
| vulnerabilities, new attack techniques, attacking cool new
| tech (e.g., self-driving cars).
|
| To pay bills: often working for tier A tech companies on
| intellectually-stimulating projects, such as novel
| mitigations, proprietary automation, etc. Or doing lucrative
| consulting / freelance work. Generally not triaging Nessus
| results 9-to-5.
| tptacek wrote:
| Specialized bug-hunting.
| martinald wrote:
| This does not surprise me. In a couple of 'legacy' open source
| projects I found DoS attacks within 10 minutes, with a working
| PoC. It crashed the server entirely. I suspect with more
| prompting it could have found RCE but it was an idle shower
| thought to try.
|
| While niche and not widely used; there are at least thousands of
| publicly available servers for each of these projects.
|
| I genuinely think this is one of the biggest near term issues
| with AI. Even if we get great AI "defence" tooling, there are
| just so many servers and (IoT or otherwise) devices out there,
| most of which is not trivial to patch. While a few niche services
| getting pwned isn't probably a big deal, a million niche services
| all getting pwned in quick succession is likely to cause huge
| disruption. There is so much code out there that hasn't been
| remotely security checked.
|
| Maybe the end solution is some sort of LLM based "WAF" that
| inspects all traffic that ISPs deploy.
___________________________________________________________________
(page generated 2025-06-24 23:00 UTC)