[HN Gopher] New headless Chrome has been released and has a near...
___________________________________________________________________
New headless Chrome has been released and has a near-perfect
browser fingerprint
Author : avastel
Score : 377 points
Date : 2023-02-19 12:33 UTC (10 hours ago)
(HTM) web link (antoinevastel.com)
(TXT) w3m dump (antoinevastel.com)
| graderjs wrote:
| I built a remote browser based on headless Chrome^0 and this is
| going to make things way easier. It's also great to see Google
| supporting Chrome use cases beyond "consumer browsing", and
| perhaps that's in large part been pushed by the "grass roots
| popularity" of things like puppeteer and playwright.
|
| One thing I'm hoping for (but have heard it would require
| _extensive_ rejigging of almost absolutely everything) is
| Extensions support in this new headless.
|
| However, if I'm reading the winds, it seems as if things _might_
| be going there, because:
|
| - Tamper scripts now work on Firefox mobile
|
| - Non-webkit iOS browsers are in the works
|
| - It's technically possible to "shim" much of the
| chrome.extension APIs using RDP (the low-level protocol that pptr
| and its ilk are based on) which would lead essentially to a
| "parallel extensions runtime" and "alt-Webstore" with less
| restrictions, something which Google may not look merrily upon
|
| Anyway, back to "headless detection", for the remote isolated
| browser, I have been using an extensive bot detection evasion
| script that proxied many of the normal properties on navigator
| (like plugins, etc), and tested extensively against detectors
| like luca.gg/headless^1
|
| Interestingly one of the most effective way to defeat "first
| wave" / non-sophisticated bots used to be simply throwing up a JS
| modal (alert, confirm, prompt) -- for the convenient way it kills
| the JS runtime until dismissed, and how you have to explicitly
| dismiss it.
|
| ^0 = https://github.com/crisdosyago/BrowserBox
|
| ^1 = https://luca.gg/headless/
| [deleted]
| natorion wrote:
| I am the PM working on Headless. Feel free to ask questions in
| this thread and I will try to answer them if I can.
|
| Edit: Please also note that we have not released New Headless
| yet. We "merely" landed the source code.
| LilyFrenchPants wrote:
| [flagged]
| natorion wrote:
| What rumors? Can you provide any links or context?
| rmorey wrote:
| what makes the new one "Native" ?
| natorion wrote:
| It's real Chromium, not emulating a Chromium browser. "Old"
| Headless was merely pretending to be a Chromium browser, the
| "New" Headless is a Chromium browser. "Old" Headless requires
| a parallel/duplicate implementation of features, which leads
| to subtle behavior differences or infeasability to support
| certain features e.g. extensions proper.
| rmorey wrote:
| wow, i had no idea the old headless was a reimplementation.
| congrats on landing the new one
| mh- wrote:
| Does this mean we might see proper extension support in
| "New" Headless?
| mike_hearn wrote:
| Do you guys ever think about abusive automation at all, or do
| you just consider that other people's problem?
| lupire wrote:
| Abusive how? Headed chrome can be automated, as can wget.
|
| Its bizarre to ask a client side program to implement server-
| side controls for users you want to allow on your site but
| throttle.
| scotty79 wrote:
| You call it abuse. Other people might call it use.
| hackernewds wrote:
| You call it use. Other people might call it abuse.
| _moof wrote:
| Are we just misquoting the Eurythmics now?
| scotty79 wrote:
| That's my point exactly.
| mike_hearn wrote:
| I've not yet encountered anyone who doesn't consider spam
| to be a form of abuse.
| scotty79 wrote:
| So easily detectable headless chrome browser ends all
| spam? And if not, why do you mention spam? Why not rape
| which is another unambiguous example of abuse?
| parker_mountain wrote:
| For what it's worth, the large "players" already seem to have
| this capability. They've forced pretty much everyone to roll
| out captchas, proof of work interstitials, and behavior-based
| fingerprinting.
|
| While my immediate response was the same as yours, I think
| this actually won't really change much in the way of bad
| actors.
|
| It's unfortunate, but basic controls (such as throttling,
| etc) are pretty much a floor-required feature - one way to
| avoid this burden is to do things like use 3rd party idp (aka
| google login). I'm not happy with the state of things but I
| don't think headless will particularly contribute to a
| material increase in abuse cases.
| pdntspa wrote:
| The implications of your question are beyond dystopian
| DangitBobby wrote:
| Please elaborate.
| supriyo-biswas wrote:
| See my comment[1] on this very thread.
|
| [1] https://news.ycombinator.com/item?id=34858232
| pdntspa wrote:
| Because it suggests adding usage controls, possibly
| enforced via cloud connectivity, to add restrictions that
| will inevitably make legitimate usage more difficult,
| frustrating, and most importantly, subject to outside
| control. Extend this far enough and the world starts to
| look like Doctorow's "Unauthorized Bread".
|
| This is an awful world, one designed to reinforce class
| divide and protect the entrenched and the rich by
| deliberately handicapping easily-accessible tools,
| because of a few bad actors. It creates a world where the
| code for literally everything is the most hideously
| complex version of itself because it is riddled with
| constant checks, phone-homes, and arbitrary usage limits.
| It further pushes us towards a disempowering future where
| our computing is limited exclusively to appliance-like
| devices whos inner workings are controlled for it. It
| stands against the very principle of general-purpose
| computing.
| robertlagrant wrote:
| That's not beyond dystopian. It's just dystopian.
|
| And implications of a question aren't either. Just your
| imagined implications. Questions aren't bad.
| starik36 wrote:
| Any chance of an build for the Raspberry Pi?
| oh_sigh wrote:
| Is it too late to change the name from "new headless"? It won't
| be new forever, and then there will need to be a new new mode,
| or a differently named one that people think is older because
| it isn't the new mode.
| dylan604 wrote:
| No, obviously, the next version will be called Newer
| Headless. Then you get the More Newer or Even Newer release.
| Or my personal favorite NewV2. /s
|
| Using the word "new" in naming conventions is the most
| moronic and shortsighted way to name things in something that
| is quite obviously going to be changing in the somewhat near
| future.
| robertlagrant wrote:
| New College is doing fine even with its name. It's just a
| name. Doesn't really matter.
| dboreham wrote:
| Also New Forest.
| oh_sigh wrote:
| It reminds me of "pont neuf"("new bridge" in French), which
| is the oldest bridge in Paris crossing the seine.
| plugin-baby wrote:
| See also: report_final_draft(1).doc
| natorion wrote:
| You would be surprised how much we talked about that .
| New/old are just relevant for the transition period.
| Macha wrote:
| But then how would you have the pleasure of figuring out the
| sort order between New $Feature, Advanced $Feature, Revamped
| $Feature and Enhanced $Feature?
| Ono-Sendai wrote:
| Can this replace chromium embedded framework (CEF)?
| natorion wrote:
| I fail to see the connection. Can you elaborate?
| nobu-mori wrote:
| Now that headless mode is a "real" Chromium instance, is it
| possible to add extension support to Chrome running in headless
| mode?
| andrewstuart wrote:
| So this argument can be used these ways:
|
| --headless
|
| --headless=new
|
| --headless=chrome
|
| And each mean something different - but what?
|
| Not documented, very frustrating.
|
| Can you explain the difference between each of the above
| arguments?
| skybrian wrote:
| Can you talk about your team's motivations for improving
| headless mode? Any particular use cases in mind?
| natorion wrote:
| Here are two of them: -Test reproducibility -Automated
| configuration rollouts in enterprise environments
| ccooffee wrote:
| Improving test environments is a huge upside. I haven't
| worked on browser automation in nearly a decade, but
| finding ways to work around shortcomings in the headless
| environment used to burn a lot of time on that team. I know
| of many small teams which made deliberate decisions NOT to
| do any browser automation tests (e.g. Selenium) because
| some issues required testing hooks in production code.
| LinuxBender wrote:
| There are many comments about potential abuse. I would be
| curious to know if your team have ever challenged each other to
| look like a real person accessing a site and the other part of
| the team tries to detect and block them? If there is anyone
| that could do this it would be the creators of Headless.
|
| Why go through the exercise, one may ask? I believe it would be
| a critical thinking exercise to improve Headless even more
| while giving website maintainers a way to opt out of receiving
| traffic from it. If not your team, have you reached out to see
| if people from project zero would take on that challenge in
| their abundance of spare time? [1]
|
| [1] - https://googleprojectzero.blogspot.com/
| natorion wrote:
| We regularly get feature requests for Headless to provide a
| field or property that can be polled by JS frameworks to
| detect if Headless is active e.g. windows.isBot.
|
| Well, Headless is open source, which means anybody could
| build a Headless version with such a property set to "I am a
| human, trust me!" and employ such a modified binary ... ;-)
| imglorp wrote:
| RFC for IPV4 evil bit.
|
| https://www.rfc-editor.org/rfc/rfc3514
| LinuxBender wrote:
| You jest, but I could actually see this becoming a thing.
| I envision a future dystopian internet where people first
| have to authenticate their network gear, PC's, laptops,
| cell phones, cars, trucks, e-bikes, toasters, coffee
| makers to a government contracted service. Once
| authenticated they utilize something similar to that RFC
| but probably instead a _nonce_ or jwt token tied to their
| device that gets embedded in the packet header somehow.
| Then sanctioning a continent, country, state, ISP, city,
| company, manufacturer, distributor or person would be
| simply disabling their _evil bits_ so to speak.
|
| The push for this is starting with adult content [1] but
| the goal posts could easily be mounted on train car with
| a very long and smooth train track that only goes
| downhill.
|
| [1] - https://news.ycombinator.com/item?id=34726509
| LinuxBender wrote:
| Oh absolutely, relying on a header would be a placebo at
| best. I was thinking more along the line of having two
| teams, one that develops Headless and another team at
| Google that try to defeat it non stop. An official game of
| cat and mouse. Project: Tom and Jerry? I guess legal would
| never buy into that name.
|
| My own personal method for my silly hobby sites is just to
| put passwords on things with an auth prompt delay.
| dmix wrote:
| Why should Google redteam their headless browser though?
| As other comments point out there's plenty of ways for
| bot detectors to id bots even with a browser which
| mirrors a normal one:
| https://news.ycombinator.com/item?id=34858056
|
| Almost all of those are things are outside of the scope
| of the browser itself. And anyone doing serious bot
| attacks already have scripts/forks that modify these
| signals. I don't see how the chrome team could do much to
| help stop that at that level.
| LinuxBender wrote:
| In theory their blue team could come up with even more
| advanced puzzles that bots trip over and then open source
| and document the bot puzzles. I don't know that they
| would, incentives _or lack thereof_ and all. If nothing
| else it might make their work day more fun.
|
| Or if I put my evil corp hat on, the incentive could be
| that they make puzzles that only Headless can get around
| and all other bots become trivial to block and obsolete
| by even the least knowledgeable hobbyist. Perhaps Google
| release Nginx, Apache HTTPD, Apache Traffic Server, Envoy
| and HAProxy modules that only Headless can get around and
| all other bots internet-wide are entirely silenced.
| Chrome becomes the one and only bot to rule them all.
| robertlagrant wrote:
| Why would they want to do that?
| LinuxBender wrote:
| Oh man, you're making me put that hat back on.
|
| I suppose that Google going through that exercise would
| mean that they get market dominance on bot gathering data
| and anyone not using Chrome Headless would be unable to
| obtain freebie data. This could enable future _features_
| whatever that may be. _readjusts hat_ One future feature
| could be auto-discovery of Google DNS and Google proxies
| in GCP so they can learn about new data sources through
| crowd-sourcing thus making their big-data sets more
| complete and their machine learning more powerful.
| Developers could block the proxies or compile them out
| but as we know most people are too lazy to do this and
| many won 't care.
|
| Another advantage would be that eventually the only bots
| abusing Google would be bots using their code and they
| would know how to detect and deal with as they would
| implement their own open source anti-bot modules in their
| web servers, load balancers, etc...
|
| There are more obscure ideas but I am doffing the hat
| before the hat-wraiths sense it.
| paulirish wrote:
| https://github.com/paulirish/headless-cat-n-mouse was this
| basic idea, but open sourced.
| BonoboIO wrote:
| At the end we come to a browser and we have to emulate a mouse
| that does all the clicking.
| PascLeRasc wrote:
| This is off topic but when did we get the ability to use spaces
| in URLs?
| syrrim wrote:
| in what sense? spaces as mod encoded (%20) values have been
| around ever since I've used the web. those spaces are
| occassionally displayed as spaces in the url bar, depending on
| the context.
| pixelesque wrote:
| Browsers have automatically done the "correct" thing
| (converting to "bot%20detection") under-the-hood for years in
| my experience. I remember MS FrontPage-made sites with spaces
| in the name and IE would work with them.
| eazyson wrote:
| Which chromium forks are there that does this?
| novaleaf wrote:
| I am using the new headless Chrome for my Browser-Automation SaaS
| (PhantomJsCloud.com) and it is working great.
|
| It fixes some nagging compatibilities with certain websites. I
| don't bother with anti-bot mitigations, and I don't expect this
| to be useful in that regard. commercial Anti-Bot doesn't care
| about how much you spoof your browser fingerprint.
|
| feel free to AMA
| newhotelowner wrote:
| Can you share the code for how to launch a new headless chrome?
| jaimex2 wrote:
| No one stopped a Chromium fork from this earlier.
| thekingshorses wrote:
| I wish I can automate some of the banking tasks. I tried but
| couldn't automate Chase, Citi or CapitalOne.
|
| If anyone has a working script to login and perform simple task
| on one of these sites, please share it.
| user3939382 wrote:
| Last time I was able to automate Chase by targeting their
| mobile site which, at the time anyway, had a dedicated URI.
| Mobile site was simple HTML and easy to scrape.
| mike_hearn wrote:
| The game continues. Back in 2010 when I was writing the first in-
| browser bot detection signals for Google (so BotGuard could spot
| embedded Internet Explorers) I wondered how long they might last.
| Surely at some point embedded browsers would become undetectable?
| It never happened - browsers are so complex that there will
| probably always be ways to detect when they're being automated.
|
| There are some less obvious aspects to this that matter a lot in
| practice:
|
| 1. You have to force the code to actually run inside a real
| browser in the first place, not simply inside a fast emulator
| that sends back a clean response. This is by itself a big part of
| the challenge.
|
| 2. Doing so is useful even if you miss some automated browsers,
| because adversaries are often CPU and RAM constrained in ways you
| may not expect.
|
| 3. You have to do something sensible if the User-Agent claims to
| be something obscure, old or alternatively, too new for you to
| have seen before.
|
| 4. The signals have to be well protected, otherwise bot authors
| will just read your JS to see what they have to patch next.
| Signal collection and obfuscation work best when the two are
| tightly integrated together.
|
| These days there are quite a few companies doing JS based bot
| detection but I noticed from write-ups by reverse engineers that
| they don't seem to be obfuscating what they're doing as well as
| they could. It's like they heard that a custom VM is a good form
| of obfuscation but missed some of the reasons why. I wrote a bit
| about why the pattern is actually useful a month ago when
| TikTok's bot detector was being blogged about:
|
| https://www.reddit.com/r/programming/comments/10755l2/revers...
|
| tl;dr you want to use a mesh oriented obfuscation and a custom VM
| makes that easier. It's a means, not an end.
|
| Ad: Occasionally I do private consulting on this topic, mostly
| for tech firms. Bot detectors tend to be either something home-
| grown by tech/social networking firms, or these days sold as a
| service by companies like DataDome, HUMAN etc. Companies that
| want to own their anti-abuse stack have to start from scratch
| every time, and often end up with something subpar because it's
| very difficult to hire for this set of skills. You often end up
| hiring people with a generic ML background but then they struggle
| to obtain good enough signals and the model produces noise. You
| do want some ML in the mix (or just statistics) to establish a
| base level of protection and to ensure that when bots are caught
| their resources are burned, but it's not enough by itself
| anymore. I offer training courses on how to construct high
| quality JS anti-bot systems and am thinking of maybe in future
| offering a reference codebase you can license and then fork. If
| anyone reading this is interested, drop me an email:
| mike@plan99.net
| narag wrote:
| What are bots used for? I can think of a few reasons, wrote a
| scraper/submitter myself in the 90's for a cooperative of
| subcontractors that was being forced to use an extremely
| sluggish web app by the big company that provided their gigs.
|
| But I guess there are all kind of purposes, some benign some
| nefarious, and that they somehow influence the bot operation
| and detection.
| 323 wrote:
| People are paying $500 for bots used to buy the latest
| Nike/Adidas/... limited edition sneakers. Or videocards a few
| years ago (for crypto mining).
|
| It's a whole industry.
|
| > _If we consider a user base of ~175 users, and a minimum
| bot price of 200 euros (175 users x 200 euros), then the bot
| developers made at least 35K euros (~$37K USD) in initial bot
| sales._
|
| https://datadome.co/threat-research/inside-sneaker-bot-
| busin...
| waynesonfire wrote:
| Artificial scarcity in sneakers is their design decision.
| These shenanigans should have zero impact on browser
| policy.
| 20after4 wrote:
| I thought about building something like that for
| photographers to get gigs from large real-estate photography
| contractors who sub-contract the work to independent
| photographers. Automated tools would benefit the
| photographers greatly. The benefit comes at the expense of
| those not using automated tools, so the morality of such a
| tool is at least somewhat questionable.
| shanebellone wrote:
| "The signals have to be well protected, otherwise bot authors
| will just read your JS to see what they have to patch next.
| Signal collection and obfuscation work best when the two are
| tightly integrated together."
|
| JS sounds like a bad match for this task. I perform similar
| checks from the backend with http headers and Python.
|
| Is there a compelling reason to stick with JS despite the added
| complexity of obfuscation?
|
| Edit: My use case is different than yours as it's part of a
| pid-free analytics application. However, bot detection is still
| an important component of that product.
| mh- wrote:
| If you're only relying on http headers, you're missing all
| but the most trivial of "bots". There are other things you
| could do with a backend-only approach but if your code
| doesn't run where the device connects to (e.g. you're behind
| a load balancer or other reverse proxy), those are largely
| unworkable.
| shanebellone wrote:
| "If you're only relying on http headers, you're missing all
| but the most trivial of bots"
|
| Very true. Capturing, processing, and storing analytics
| data long-term is expensive. If I eliminate even 50% of
| that noise, the savings will be worth it.
|
| I'm attempting to identify the bulk of bots with http
| headers and real-time session monitoring. I also have an
| unauthorized list (known bad actors) and an ignore list
| (search bots, etc.). It works pretty well but definitely
| doesn't begin address the problem as a whole (from a
| security perspective).
|
| It's an interesting and complex topic.
| ggambetta wrote:
| Heh, I had a feeling you'd show up here. Hi, Mike :)
| mike_hearn wrote:
| Long time no see mate :)
| 20after4 wrote:
| Re: your ad.
|
| This sounds like a solid product / startup idea to me. I worked
| on spambot detection in a previous job and it's not at all
| trivial to solve. Though we were specifically interested in
| detecting the abusive use of bots, not bots in general, so I
| focused simply on detecting unusual resource consumption rather
| than fingerprinting.
| mike_hearn wrote:
| There are startups doing this sort of thing already, the
| article is written by the head of research at one. But tech
| firms often like to have their own in-house stack with the
| source code.
| kbuck wrote:
| What do you mean by a "mesh-oriented obfuscation"? My best
| guess is: serving a different subset of the VM detection code
| to each client?
| ilyt wrote:
| Why my first reaction on the last part is "oh no!"? Seems
| something that would have more illegitimate/annoying use cases
| than good
| nine_k wrote:
| Can't you say the same about a real browser with Selenium
| driving it? It's been available for years; has it been hugely
| detrimental for something?
| jeroenhd wrote:
| It's not like spam farms can't use their own version of
| Chromium that already mimicks a real browser. Relying on client
| side indicators for your bot detection will only catch the bots
| that don't care about being caught in the first place. Show an
| alert that says "welcome to my site!" for any browsers
| originating from a data center and you've probably filtered
| most of those out.
|
| I like automating menial tasks in shitty web UIs (i.e. clearing
| out a list of sessions/search history/ad providers that only
| allow removing a single entry at a time). Simply using Firefox
| also gets flagged by a lot of these shitty bot detection
| services. I've never seen them do any useful work.
|
| The only exception is maybe reCAPTCHA or Cloudflare's
| alternative; that seems to be quite good at catching actual
| bots, but I do hate most websites that use them because in
| Firefox you end up clicking on boats twenty times. They're also
| trivially bypassed by delegating your spamming to click farms,
| as 1000 minimum wage workers in a faraway country can be
| cheaper than paying for dev time to work around the minor
| nuisances of bot detection.
| jasmer wrote:
| We should assume anyone visiting a site without some kind of
| credentialed login is a 'bot'.
|
| Or for all intents and purposes 'noise' traffic.
|
| It'd be nice for the powers that be develop an anonymous cookie
| standard to allow people to flag themselves as 'humans' without
| enabling the host to know anything about them.
|
| We are fighting wars over problems that we have created for
| ourselves.
| harrisonjackson wrote:
| We have a chatbot that can send users screenshots of their CMS
| views (kanban, calendar, tables, gallery, etc) from inside of
| Slack.
|
| The screenshotting uses puppeteer and chromium and a read-only
| session to impersonate the user and screenshot their dashboard.
|
| It uses the old version of chromium and there were many gotchas
| that required a lot of extra scaffolding to actually render ours
| and other websites like they would on my laptop. This will
| hopefully make it easier for us to maintain once implemented.
| eimrine wrote:
| > navigator.plugins.length = 0
|
| So, any website on the Internets can know how many plugins my
| browser has? Ridiculously!
| joshschreuder wrote:
| It would seem like no, in recent times at least. In recent
| browser versions (Chrome 94+, Firefox 99+, etc.) it's been
| changed to only report the default PDF plugins
|
| https://developer.mozilla.org/en-US/docs/Web/API/Navigator/p...
| transitivebs wrote:
| The cat & mouse game continues...
| natorion wrote:
| PM working on Headless here. Masking bots is not the reason why
| the new Headless mode was created. The goal is to provide an
| headless browser that can be used in web tests. The original
| Headless is essentially a separate browser implemented in
| parallel to "proper" Chromium. That results in all sorts of
| subtle reproducibility problems for developers using Headless
| for their tests.
| elbigbad wrote:
| PM working on Private Browsing mode here. Watching
| pornography is not the reason why the new Private Browsing
| mode was created. The goal is to provide an Private Mode than
| can be used for Christmas shopping. ;)
|
| In all seriousness, despite intentions, and I do love
| headless mode for actual integration tests with Webdriver,
| it's no exaggeration to say that it is likely the single
| greatest avenue for bots and spam enablement across the
| entire internet, and imo is probably net Bad.
| zarzavat wrote:
| If it weren't for bots there would be no search engines, no
| internet archive, no WWW. Bots, and the tools for making
| them, are essential to the functioning of the web.
| dmix wrote:
| A necessary evil for supporting an open and programmable
| internet (IMO).
| ufmace wrote:
| It seems more neutral to me. Yes there's a lot of spam and
| other types of malicious behavior, but I don't think it's
| good overall to try to eliminate web automation entirely to
| stop it.
| ilyt wrote:
| > PM working on Headless here. Masking bots is not the reason
| why the new Headless mode was created.
|
| Right. But it will be massively used just for that.
| literallyroy wrote:
| Yes, same as many technologies with legitimate uses. Tor is
| largely used for illegal activities, yet many would say the
| anonymity it provides for the general public is worth it
| being created (or the anonymity it provided for US
| intelligence).
| ilyt wrote:
| I'm not chasting anyone for building a piece of cool tech
| but that does seem like something like a holy grail for
| bots.
| richwater wrote:
| Nice to know you are the arbiter of what is and what
| isn't a "cool piece of tech"
| ilyt wrote:
| I don't know whether you're illiterate or just
| maliciously misinterpreted what I wrote.
| SeanAnderson wrote:
| This is such good news to hear. Browser test automation was a
| pretty sore spot. I'm excited for your work.
| danaris wrote:
| > Masking bots is not the reason why the new Headless mode
| was created.
|
| You might consider looking into some resources on Intent vs
| Impact (eg, [0]).
|
| IMNSHO, anyone working in tech has a responsibility to
| consider what their creations _can_ be used for, in addition
| to what they _intend_ them to be used for. There 's just too
| much potential for scalability of nefarious behavior to do
| otherwise.
|
| [0] https://www.masterclass.com/articles/intent-vs-impact
| tssva wrote:
| Please reveal what you work on so I can publicly judge
| whether you have considered and properly chosen between
| intent vs impact or any other possible moral failings of
| your work as I see it.
| Mimmy wrote:
| I'm naive here but why would Chrome release a headless browser
| that makes it easier for bot developers to avoid detection?
| TAKEMYMONEY wrote:
| headless browsers or are faster than the normal browsers (no
| GUI) so your tests run faster
| [deleted]
| ethbr0 wrote:
| Because none of the people complaining about headless bots
| (read probably: content and retail) are major stakeholders
| from Chrome's viewpoint.
| dmix wrote:
| This blog post is written that way because the guy works in
| the bot detection business so it's what he cares most about.
|
| But there are still plenty of legitimate use cases for
| wanting a headless browser that perfectly replicates a normal
| browser environment. The obvious ones are automated frontend
| testing tools like https://playwright.dev/
| hoistbypetard wrote:
| Exactly. And as the blog post mentioned, people who have a
| strong need to block bots have tools other than browser
| fingerprinting at their disposal. Quoth the post:
|
| > It's important to leverage other signals such as:
|
| >
|
| > * Behavior (client-side and server-side)
|
| > * Different kinds of reputations (IP, sessions, user)
|
| > * Proxy detection, in particular, residential proxy
| detection
|
| > * Contextual information: time of the day, country, etc
|
| > * TLS fingerprinting.
|
| Having a headless browser that behaves exactly like a
| normal one is tremendously useful for making things. And
| people who really *need* to block bots also need to contend
| with "mechanical turk" style attackers anyway. These
| techniques are also very useful against that approach,
| which still may be cheaper than making an undetectable bot
| even with a near-perfect Chrome fingerprint available
| headless.
| mercurialuser wrote:
| We use a headless browser to load an internal webpage (with
| content that may be updated several times per day) and
| generate a pdf on-demand.
| dataviz1000 wrote:
| As a bot developer, without taking legal steps (I do not
| break the law) there is no stopping me regardless.
| hashseed wrote:
| Chrome sets navigator.webdriver to true when controlled by
| automation.
|
| Until now, bots could simply use headful mode to achieve the
| same effect that is now made available through the new headless
| implementation.
| [deleted]
| [deleted]
| TAKEMYMONEY wrote:
| > _the new headless Chrome can still be detected using JS browser
| fingerprinting techniques [...] however, the task has become more
| challenging [...] I'm not going to share any new detection
| signals_
|
| Any guesses?
| botflyguy wrote:
| In the bot detection methods I've seen so far on this, a large
| part of it is timing analyses where there is a significant
| difference between headed and headless, e.g. graphical
| operations, audio processing.
| zelphirkalt wrote:
| That could be circumvented rather easily I guess, by using a
| non-headless (head-having? head-full? headed?) browser
| instead. And perhaps adding some random human-seeming delay
| in interactions.
| zahrc wrote:
| Headed browser.
|
| And maybe, but that will make enduser suffer more (as
| always), as more false-positives will be caught.
| bornfreddy wrote:
| That, or making sure that mouse really moved somewhere (in a
| sensible way) before the click occured.
| mwill wrote:
| This would have false positives for some accessibility
| software, I believe
| vntok wrote:
| True, that's why you don't want to block the pageload on
| this signal alone, just use it to trigger a captcha.
| hoistbypetard wrote:
| It's pretty awful to make people who need accessibility
| software go through more captchas. Those are an
| accessibility nightmare.
| ryandrake wrote:
| Or even non-disabled people who typically browse using
| the keyboard only. Please stop sending users who you find
| inconvenient to captchas!
| shp0ngle wrote:
| The best way to catch a robot is just to slap a captcha there.
| Everything else is kind of useless and not effective.
| luckylion wrote:
| Captchas also tell apart the average human visitor from the
| very committed human visitor that really, really, really needs
| to do whatever they can do on your website.
| lupire wrote:
| They are also very good at distinguishing paying Google users
| who get the fast-pass to Google captchas.
| luckylion wrote:
| Is that a thing? What services do you need to buy to bypass
| recaptcha?
| bornfreddy wrote:
| Ha ha, good point. When presented with captcha I often decide
| I don't care that much and just close the page.
| Symbiote wrote:
| I do the same, but sometimes I wish I could give better
| feedback.
|
| "Dear British Airways, I booked with SAS instead because
| you assumed a Linux user with Firefox was a bot."
|
| (Or maybe it was the other way round, I forgot.)
| sfe22 wrote:
| That means way more captchas after this release, yay
| phiresky wrote:
| Getting captchas solved reliably via a service costs around $1
| per 1000 captchas so captchas are kinda useless as well if
| there's a tiny monetary incentive to get to whatever is behind
| the captcha.
| hackernewds wrote:
| How is that accomplished? Real humans?
| Alifatisk wrote:
| Mhm
|
| https://www.deathbycaptcha.com/
| from wrote:
| https://2captcha.com
|
| Yes
| nine_k wrote:
| Real humans, in places where $10 / day is reasonable money.
| dewey wrote:
| Depends on the captcha but there's many popular services
| that you can plug into your code through APIs for bypassing
| captchas (https://www.2captcha.com, https://anti-
| captcha.com). I think the hardest one is probably the
| invisible reCaptcha Enterprise.
| gsich wrote:
| Why even bother?
| TAKEMYMONEY wrote:
| > _However, with recent progress in automatic and audio
| recognition, [detecting bots with captchas] has evolved_
|
| ...and that 's from _3 years_ ago
|
| https://antoinevastel.com/javascript/2020/02/09/detecting-we...
| XzAeRosho wrote:
| How do captchas work for blind people behind screen readers? I
| usually use a lot of keyboard strokes which seems to trigger a
| lot of captcha systems
|
| So far, the play audio option are kind of weird, specially if
| you're hard of hearing.
| dewey wrote:
| They sometimes have an audio version, unfortunately at the
| same time this one is used to bypass the captcha through
| audio recognition software.
| kerpotgh wrote:
| [flagged]
| Symbiote wrote:
| Educate yourself before writing such selfish nonsense.
|
| https://www.sense.org.uk/about-
| us/statistics/deafblindness-s...
| yreg wrote:
| This is a terrible notion.
|
| > Accessibility is for everyone, including you, if you live
| long enough and the alternative is worse. So your choice is
| death or you are going to use accessibility features.
|
| - John Siracusa
|
| Also, making services accessible is not only the obviously
| right thing to do, but also the law here in EU.
|
| https://en.wikipedia.org/wiki/European_Accessibility_Act
| krono wrote:
| Offering up some of our strength, ability, and comfort to
| help others who might be less fortunate or whose qualities
| lie elsewhere is what makes us human, and probably played a
| large part in getting us where were are today.
|
| You might be part of a very miniscule group yourself, if
| this is really what you believe.
|
| Our digital world will never be perfect, but allowing for
| everyone to at least be able to access and benefit from it
| is very much something we can and should do.
| Timon3 wrote:
| This is a terrible take. The more technology is integrated
| into society, the more we need to offer different avenues
| to access it. Otherwise we'll be excluding the differently-
| abled from many parts of society, and at some point we
| really should be able to put that behind us...
| [deleted]
| yreg wrote:
| All captchas are soon going to be difficult to solve for humans
| and easy to solve for bots. Many already are. They also have
| terrible accessibility.
| chuckwolfe wrote:
| I tried with akami and it still didn't work. Still need the
| stealth plugin and some additional tweaks to bypass
| chirau wrote:
| How do i set the _new_ part of the headless flag in Python?
|
| The article mentions that to use this you need to specify the _--
| headless=new_ flag.
|
| I know that to set the headless flag i can just use this code:
| from selenium.webdriver.chrome.options import Options
| options = Options() options.headless = True
|
| But how would I specify the new part of the flag/option?
| mixedCase wrote:
| [flagged]
| pRusya wrote:
| There's a mention to this in the recent Selenium blog post
| https://www.selenium.dev/blog/2023/headless-is-going-away/#a...
|
| Basically omit options.headless and use options.add_argument("
| --headless=new") instead.
| londons_explore wrote:
| If you add DRM video playback to the fingerprint, it is pretty
| much impossible to fake...
|
| Either they have a real TPM with a real nvidia graphics card able
| to decrypt content with a real serial number... Or they don't...
|
| If one graphics card or TPM serial number starts acting bot-like,
| you can ban just that one.
| beagle3 wrote:
| Also shutting out a lot of older and weird devices (internet
| fridges, dumb smart tvs, and more, many Linux and bsd users)
| who can't play DRM.
|
| Some sites won't care, but for some this will be too high a
| price for avoiding headless bots.
| azalemeth wrote:
| I browse with DRM disabled. Every time it gives me a
| notification about it, I view it as a "hah, fingerprinting
| avoided!" signal.
|
| Sites that use it get my anti-traffic. I don't buy, support, or
| condone DRM'd media and I actively disable EME on every browser
| I come across...
| nine_k wrote:
| Good for you.
|
| There are sites that commercially distribute DRMed video
| content; say, Netflix. They have a large audience, and they
| care, whether me and you like it or not.
| lupire wrote:
| How much of that audience is watching on a device without a
| video card? Almost none.
| nine_k wrote:
| AFAICT, the server can avoid serving the DRMed content
| until the browser proves it has a legitimate DRM-
| respecting playback capability, which is designed to be
| hard to feign. That is, unless something like [1] is
| correctly implemented in the headless mode, DRM content
| won't be available anyway.
|
| Am I missing anything?
|
| [1]: https://developer.mozilla.org/en-
| US/docs/Web/API/Navigator/r...
| onion2k wrote:
| What use case is there for accessing DRM video content
| using a headless browser?
| sebzim4500 wrote:
| Automated downloading of the content, I assume.
| azalemeth wrote:
| I've never used Netflix (or other streaming sites like
| them) _because_ of the DRM. Youtube manages to prove that a
| streaming model can be very, very profitable without it at
| all, as does BBC iPlayer.
| andrewmackrodt wrote:
| Using Netflix as the example, Widevine L1 has very limited
| support on the desktop, i.e. Microsoft Edge on Windows and
| Safari on macOS.
|
| All other configurations use L3 which is a shared key, e.g.
| provided by ChromeCDM as it runs entirely on the CPU -
| which is why Netflix content also works under Linux, albeit
| L3 is limited to 720p (or 1080p with browser extensions).
|
| Given Chrome's massive browser market share, I'm not sure
| whether enabling DRM adds anything meaningful to the
| fingerprint - i.e. I don't think it's possible to revoke an
| L3 key without pushing out a new version of the CDM to all
| users of that browser, as has happened once before with
| Chrome.
|
| FWIW I've tested Widevine L3 decryption works using a
| "headless" docker container running Chrome. The only caveat
| to add is that Chrome must not be started with --headless,
| but you don't need a real GPU either, Xvfb works just fine.
| 2h wrote:
| > I don't buy, support, or condone DRM'd media
|
| this is good, but it would also be helpful if you supported
| the anti DRM movement. Some people have developed ways to get
| around certain DRM such was Widevine, from dumping your own
| CDM to Widevine proxy. Just ignoring the problem is not going
| to make it go away. Over the last two years DRM use for
| streaming content has increased significantly. If you want to
| really help, I would look into contributing code to these
| projects, or donations.
| lupire wrote:
| [flagged]
| theyeenzbeanz wrote:
| We don't want to deal with having to be forced into
| having specific hardware, operating systems, and browsers
| to watch content we paid for. I've had perfectly good
| monitors that were before HDCP was a thing, and these
| sites gimp the quality or outright refuse to play media
| because the monitor didn't have some bogus technology.
| jimmydorry wrote:
| DRM has a huge impact on what I consume. For example only
| being able to watch Netflix at 720p due to running a *nix
| distro.
| rwmj wrote:
| Even as someone who isn't in the slightest interested in
| unauthorised copying of content, watching videos on
| anything which isn't VLC on my laptop is such a PITA that
| I never do it.
| sarnowski wrote:
| TPMs do not reveal a unique serial number or similar identifier
| by design for privacy reasons.
|
| A TPM can attest that some measurements were done with it and
| it can attest that it comes from vendor X. You can block an
| entire vendor if they don't behave but not individual TPMs via
| remote attestation.
|
| You can use a scheme in which you can set up an ,,identity" on
| first use and then on next use authenticate the same identity.
| But that identity is kinda per use case.
| melvyn2 wrote:
| I was under the impression that the EK could be used to
| identify individual TPMs- why can't it?
| jefftk wrote:
| _> If one graphics card or TPM serial number starts acting bot-
| like, you can ban just that one._
|
| I don't think you can get the serial number, though?
|
| (And if there was an API for this it wouldn't be a passive one,
| which makes it inapplicable for fingerprinting)
| 323 wrote:
| Can you report back the TPM serial number to the webserver?
|
| If so, why isn't this used as an immutable ever-cookie that
| can't be deleted?
| t0mas88 wrote:
| You can't, the parent comment has combined a few real world
| possible things into an impossible combination.
| RobotToaster wrote:
| Why couldn't they just use a software TPM?
| redox99 wrote:
| I don't believe DRM fingerprinting is used in the wild. Firefox
| shows when DRM is being used (like Netflix) and I've never seen
| it used outside that.
| ffpip wrote:
| Reddit's website uses DRM for fingerprinting -
| https://iter.ca/post/reddit-whiteops/
| redox99 wrote:
| Maybe they changed their mind on that, because it does not
| show me any DRM usage as of now.
| xnx wrote:
| How does this work? Wouldn't a lot of real user-agents not have
| this capability and therefore not be able to be fingerprinted
| and banned in this way?
| nullifidian wrote:
| Are there non-headless browsers modified specifically to have
| extremely generic fingerprints? Hiding OS, GPU, fonts everything.
| worksonmine wrote:
| Not a browser but Arkenfox[1] hardens standard firefox. But
| it's not for everyone and using something this specific can be
| a problem in itself.
|
| [1]: https://github.com/arkenfox/user.js/
| matterhorn2000 wrote:
| Firefox (and probably others) have fingerprint protection.
| https://support.mozilla.org/en-US/kb/firefox-protection-agai...
| nullifidian wrote:
| Any chromium based forks?
| Eisenstein wrote:
| Brave.
| krmbzds wrote:
| +1. Also make sure to disable all cryptocurrency and
| "web3" related plugins for a pleasant experience.
| jeroenhd wrote:
| Going by the amount of upset advertising/cyberstalking
| companies that Brave is indistinguishable from Chrome, I
| think this may be the answer.
|
| I don't like the way they pretend(ed) to send funds to
| websites using their cryptocurrency services, though.
| Good software, sketchy company.
| zelphirkalt wrote:
| Tor browser (based on Firefox) seems to fit that bill.
| cratermoon wrote:
| > As you can imagine, given my position at DataDome (a bot
| detection company), I'm not going to share any new detection
| signals as I used to do
|
| Here comes the sales pitch....
| supriyo-biswas wrote:
| I'm assuming the next step will be to bring to Cloudflare's pet
| project of TPM attestation into Chrome, otherwise known as
| PATs[1]. And just like that, not only would headless be defeated,
| but all of you using rooted devices and small time browsers would
| be left high and dry.
|
| It's "Right to read"[2] all over again.
|
| [1] https://www.ietf.org/archive/id/draft-private-access-
| tokens-...
|
| [2] https://www.gnu.org/philosophy/right-to-read.en.html
| judge2020 wrote:
| What is the solution to automation then? What do I do when
| someone hits my content-rich Wordpress blog with a scraper that
| hits 100 pages a second to download my content, and my database
| falls over leading to real, legitimate users being unable to
| use my site? What if it's not a legitimate scraper but someone
| with hundreds of proxies uses them to DDOS my site for days?
| Should I sacrifice my uptime to protect the freedom of those
| unwilling to attest that they're running on real hardware?
| aumerle wrote:
| rate limit. Or paywall.
| simonw wrote:
| Put your WordPress blog behind a caching proxy with a 5s TTL
| - that way any amount of traffic to a URL will produce at
| most one hit every 5 seconds to your backend.
|
| I've used this trick to survive surprise spikes of traffic in
| multiple projects for years.
|
| Doesn't help for applications where your backend needs to be
| involved in serving every request, but WordPress blogs
| serving static content are a great example of something where
| that technique DOES work.
| [deleted]
| supriyo-biswas wrote:
| Proof-of-work schemes such as Hashcash[1] and simple
| ratelimiting algorithms can act as deterrents to spamming and
| scraping attacks.
|
| There are other kinds of non-invasive bot management you can
| do as well, however, due to various reasons I'm not in a
| position to talk about it. A few other methods are mentioned
| at the end of the post being discussed[2].
|
| [1] https://en.wikipedia.org/wiki/Hashcash
|
| [2] https://antoinevastel.com/bot%20detection/2023/02/19/new-
| hea...
| geokon wrote:
| Wasn't mining in the browser basically shutdown by every
| major browser?
|
| It was done super fast.. one can't help but think that
| Google pull all the levers they had at Apple/Mozilla to
| made sure the first viable alternative to advertisement was
| killed before it was born. But I think as a side effect it
| make PoW might be sort of impossible?
|
| I don't really know how to mining "fingerprinting" works
| exactly - so would be curious to know if I'm wrong
| duskwuff wrote:
| What killed "mining in the browser", more than anything
| else, was:
|
| 1) It was almost exclusively used for malicious purposes.
| Very few legitimate web sites used cryptominers, and it
| was never considered a viable substitute for display
| advertising; it was primarily deployed on hacked web
| sites. Browser vendors were relatively slow to react;
| many of the first movers were actually
| antivirus/antimalware vendors adding blocks on
| cryptominer scripts and domains.
|
| 2) The most popular cryptominer scripts, like Coinhive,
| all mined the Monero coin. (Most other cryptocurrencies
| were impractical to mine without hardware acceleration.)
| Monero prices were at an all-time high at the time; when
| Monero prices crashed in late 2018, the revenue from
| running cryptominer scripts dropped dramatically, making
| these scripts much less profitable to run. (This is
| ultimately what led Coinhive to shut down.)
| jefftk wrote:
| Proof of work isn't very practical here, because
| computation is a lot cheaper in datacenters than on phones.
| supriyo-biswas wrote:
| The trick is to prevent the offloading of the proof-of-
| work challenge to another device, as suggested in the
| Picasso paper[1].
|
| [1] https://storage.googleapis.com/pub-tools-public-
| publication-...
| jefftk wrote:
| Neat! This does seem like it should work!
|
| Semantic quibble: it's less "proof of work" and more
| "proof of hardware+work". Or, as they call it, hardware-
| bound proof of work. The reason you can't offload the
| challenge to a more powerful device is that they rely on
| identifying stable differences for each device class that
| ultimately trace down to the hardware they're running on.
| mindslight wrote:
| From reading the abstract, isn't this just exploiting the
| same class of security vulnerabilities that the OP is
| lamenting are being fixed?
| jefftk wrote:
| Not sure. Maybe not, if it's about device-specific
| information instead of headed-vs-headless distinctions?
| schlauerfox wrote:
| Can privacy be preserved with zero knowledge proofs? I
| don't like the idea of universal fingerprinted devices in
| an already heavily authoritarian world.
| camgunz wrote:
| The method to stop a (D)DoS is the same as it always was:
| caching and rate limiting.
|
| Re: content scraping -- I was an indie web dev of a sort for
| a while and people always ask this question, and the answer
| is it's impossible to stop. Not even Facebook or big content
| sites like CNet or The Verge can stop it. At the bottom of
| it, you can just access the site in a browser and save the
| source. Content scraping is a rephrasing of "viewing content
| even just once". Stopping it is antithetical to the web and
| technologically infeasible.
| robomc wrote:
| it's probably actually cheaper to pay people piece rates to
| do it for you in a browser than to pay a developer to write
| and maintain a scraping script anyway, so if the later
| became genuinely impossible moving to the former isn't a
| big deal.
| RobotToaster wrote:
| PoW captcha like MCaptcha. (It's technically not a captcha,
| for the pedantic)
| inetknght wrote:
| > _What do I do when someone hits my content-rich Wordpress
| blog with a scraper that hits 100 pages a second to download
| my content, and my database falls over_
|
| It's a blog. Blogs are not complex. Why is your blog's
| database so awfully designed that 100 pages a second causes
| it to fall over?
|
| > _leading to real, legitimate users being unable to use my
| site?_
|
| You assume that a scraper is not a legitimate user. I argue
| otherwise. If you don't want a scraper to use your site then
| put your site behind a paywall.
|
| > _What if it's not a legitimate scraper but someone with
| hundreds of proxies uses them to DDOS my site for days?_
|
| If it's a network bandwidth problem, then a reverse proxy
| (eg, CDN) solves that.
|
| > _Should I sacrifice my uptime to protect the freedom of
| those unwilling to attest that they're running on real
| hardware?_
|
| All software runs on real hardware. What is your exact
| question?
|
| I am accessing this site in a virtual machine. I could be
| doing it with a headless browser. Why does that matter at
| all?
| Cthulhu_ wrote:
| Make it someone else's problem; put a caching CDN in front of
| it, like Cloudflare, who have experience with these problems
| (like intentional or accidental DDOS).
| supriyo-biswas wrote:
| I understand and agree with the suggestion of putting a
| CDN, but it's somewhat ironic to suggest the use of
| Cloudflare when that very same company is advocating for
| the DRM-for-webpages scheme.
| sethhochberg wrote:
| Is it not a fair to assume that Cloudflare, as a company
| who have made a name for themselves selling various DDoS
| protection services, realize they're in an arms race with
| the old school way of handling these problems are are
| pursuing more advanced solutions before the current
| techniques are entirely useless?
|
| It would be easy to point to the irony of saying "instead
| of supporting Cloudflare's proposals for PATs, use their
| CDN product for brute force protection" but on the other
| hand, they employ a lot of experts in this space and
| might see the writing on the wall in an increasingly
| adversarial public internet.
| [deleted]
| formerly_proven wrote:
| If you can only use PATs on headlessless machines, then they're
| actually headpats. Everybody loves headpats. I don't see a
| problem.
| qbasic_forever wrote:
| What stops someone from making a fake TPM that speaks the
| appropriate protocol and just instantly signs off on every
| request? AFAIK there isn't some grand/central list of trusted
| TPM modules. Anyone can implement one as a Linux driver:
| https://www.kernel.org/doc/html/latest/security/tpm/tpm_ftpm...
|
| A fake TPM would be useless for security but just fine for
| fooling websites that there is a real human at the computer.
| melvyn2 wrote:
| From wikipedia:
|
| > Computer programs can use a TPM to authenticate hardware
| devices, since each TPM chip has a unique and secret
| Endorsement Key (EK) burned in as it is produced.
|
| That EK is signed by the TPM manufacturer, and so it's likely
| they'll only trust the keys of physical TPM manufacturers.
| Good luck forging that in software.
| userbinator wrote:
| I wonder if we'll get a cat-and-mouse game with
| miscellaneous TPM manufacturers "accidentally" leaking
| their keys, getting blacklisted, creating new ones, etc.
| I'd like to think that there's at least a nontrivial amount
| of the population wanting to subvert the authoritarian
| corporatocracy and with the skills to do so.
| qbasic_forever wrote:
| It's going to be an extremely janky or very private website
| if they only allow you to use it when you have 1 of like a
| dozen supported and approved hardware TPMs to view it.
| syrrim wrote:
| The latest windows version requires a hardware tpm on a
| device in order to be installed. Every hardware vendor
| has therefore included a tpm on all their new machines.
| This was already standard on apple devices, and many
| android devices have one as well.
| qbasic_forever wrote:
| Sure but someone who wants to build a web scraper won't
| care, they could use their own homebrew TPM that does a
| no-op and claims a user pressed a button or was present
| when they actually were not there.
|
| I doubt websites will go to the trouble to keep a list of
| approved TPMs. It's the SSL root certs nightmare all over
| again and even worse. No one is going to want to deal
| with managing a whole new giant list of devices, having
| fire drill updates to revoke compromised ones, etc.
___________________________________________________________________
(page generated 2023-02-19 23:01 UTC)