hngopher.com

       [HN Gopher] The /unblock API from Browserless: dodging bot detec...
       ___________________________________________________________________
        
       The /unblock API from Browserless: dodging bot detection as a
       service
        
       Author : keepamovin
       Score  : 99 points
       Date   : 2024-02-27 17:30 UTC (5 hours ago)
        
 (HTM) web link (www.browserless.io)
 (TXT) w3m dump (www.browserless.io)
        
       | adsthrowaway99 wrote:
       | For a while I worked in ads, and the specific team I was in put
       | us in a position of being on both sides of this. We were tasked
       | with monitoring advertisements for fraud, which meant we both had
       | to catch bot traffic, but also we had to check ads for
       | scams/phishing/etc. So on the one hand we were trying to plug
       | holes in our own bot detection for botnets trying to drive up
       | impression numbers to get website operators paid more, but on the
       | other hand we were trying to bypass bot detecting on the part of
       | phishing page authors (who would do things like display a totally
       | plausible e-commerce page if they think the ad viewer is a bot,
       | but a bank of america phishing page if they think it's a real
       | user).
        
         | Avamander wrote:
         | > other hand we were trying to bypass bot detecting on the part
         | of phishing page authors
         | 
         | An approach I recently saw gated the phishing page behind a
         | Google Account login page. User that was logged in wouldn't
         | even notice the brief redirect. Scanners would just get stuck.
         | I hope they've patched it by now though...
        
           | yjftsjthsd-h wrote:
           | A coincidence of how I work with my browser profiles is that
           | the browser (profile) that does most of my web browsing isn't
           | logged into Google. It's nice to see that this has extra
           | benefits:) Now if you'll excuse me, I need to go move my
           | entire computing environment into a VM so that viruses will
           | think they're being analyzed and refuse to run... (Kidding
           | but also not)
        
         | spondylosaurus wrote:
         | Ha, I think we may have worked at the same company (or at least
         | at competing companies)....
        
       | nhggfu wrote:
       | someone got tweak we can use for puppeteer to achieve this
       | effect? [kinda LOL @ asking people who SCRAPE DATA for free to
       | pay for an API, no?]
        
         | ohthatsnotright wrote:
         | You can use Firefox nightly and create a webextension using
         | their internal APIs. Gives you access to injecting non-
         | synthetic or faked events (they look indistinguishable since
         | they are emitted in the same code path that the real input
         | device uses) and scraping the DOM with the added advantage of
         | not exposing the automation flags that Firefox has when you use
         | Puppeteer and the like in headless, automated mode.
         | 
         | edit: Firefox Nightly because it gives you much deeper access
         | to Firefox internals.
        
           | nhggfu wrote:
           | interesting. would love to see some examples / code
        
             | ohthatsnotright wrote:
             | I can't give examples (I did the work for a company so I
             | don't own it) but starting at [0] [1] should get you to the
             | internals. It definitely carries some risk of Firefox
             | Nightly just randomly breaking when it updates but this
             | solution got us beyond quite a number of bot detectors so
             | if you're willing to risk Firefox Nightly it's possible. We
             | didn't need nearly the flexibility something like Puppeteer
             | has since we were injecting rrweb into the target page to
             | record the DOM mutations and sending those to other
             | clients.
             | 
             | That will give you a good start on where to even begin
             | looking. web-ext is insanely handy for launching Firefox
             | with a WebExtension already installed (but I did a lot of
             | other work around this to productize the offering).
             | 
             | [0] https://firefox-source-
             | docs.mozilla.org/toolkit/components/e... [1]
             | https://firefox-source-docs.mozilla.org/overview/gecko.html
        
               | generalizations wrote:
               | What are some sites that have good bot detection? I'd be
               | interested in just seeing how my scraper holds up against
               | them.
        
         | michaelt wrote:
         | There's actually an (admittedly small) data scraping industry.
         | 
         | Let's say you are Wal-Mart, and you'd like to know which of the
         | products you sell are available cheaper at Target. Or which
         | neighbourhoods they deliver to that you don't, or which stores
         | have longer opening hours than yours, or whatever.
         | 
         | You can't legally exchange data with Target directly, that
         | would risk making you an illegal price-fixing cartel. You _can_
         | legally visit their website, but you don 't feel like matching
         | up 10,000+ products.
         | 
         | Instead, you call up a business intelligence firm who has
         | already scraped your site and theirs and matched the products
         | up. They'll send you a CSV, for a price.
        
       | mmcclure wrote:
       | I can't help but feel a little conflicted about things like this
       | in the current climate. On one hand, as a developer trying to
       | just get some updated stuff into a spreadsheet, this looks
       | extremely useful.
       | 
       | On the other hand, there are a lot of people right now that want
       | to keep stuff accessible to humans and not have it scraped for
       | models. I know the lid's completely off the box on that front so
       | it's probably useless to fight it, but seeing _products_
       | explicitly designed to circumvent bot prevention feels kinda bad.
        
         | imathrowaway wrote:
         | Agreed, we've seen a lot of governing bodies _now_ start
         | scraping to fight scam sites and "fake" sites, so there's a lot
         | of legitimate use-cases popping up. It's not necessarily just
         | about gathering e-commerce data or backfilling lack of an API
         | anymore.
         | 
         | Source: Founder@browserless.io
        
           | Matheus28 wrote:
           | So how will you ensure your service is only used for "good"?
        
             | 67j67j67n wrote:
             | no services remain "good" so why ask the question. it gets
             | ignored or answered but then a decade or two down the line
             | the philosophy gets scrapped, a story as old as time.
        
               | chankstein38 wrote:
               | I still don't understand why Google publicly removed "Do
               | No Evil" from their company. It's one thing to not follow
               | it but to actively remove it feels weird. Like they
               | wanted to say "HEY Y'ALL GET READY FOR EVIL!"
        
               | ziddoap wrote:
               | You can go to https://abc.xyz/investor/google-code-of-
               | conduct/ and ctrl+f "evil", and you'll see it's still
               | there.
        
               | qingcharles wrote:
               | Can confirm this Google document contains the word
               | "evil."
        
           | scarface_74 wrote:
           | I fail to see any legitimate use case for getting around
           | protections that web dm developers put in to help prevent
           | this.
        
           | mmcclure wrote:
           | I know this is slightly off topic, but I have to ask: Why not
           | create a new account rather than use the same self-described
           | throwaway account you used to ask about raising prices (for
           | browserless?)[1]? For what it's worth, I initially assumed
           | you likely weren't legitimate and didn't engage, so if you
           | want to keep contributing to threads as yourself you might
           | want to consider using a new/different username.
           | 
           | [1] https://news.ycombinator.com/item?id=21271224
        
       | pkiv wrote:
       | In the end, the best way to avoid being blocked is to be a good
       | actor. All of these hacks won't stop someone who's determined to
       | prevent access (ie: LinkedIn).
       | 
       | That's actually one of the reasons why I started
       | https://browserbase.com/. Maintaining headless browser
       | infrastructure can be such a pain. I've spent a lot of time
       | managing headless chrome fleets at scale, so happy to answer any
       | questions.
        
         | Etheryte wrote:
         | If I understand correctly, a lot of the issues you can run into
         | with regards to blocking come from the fact that you're using a
         | headless browser. Past a certain point, wouldn't it be less
         | work to use a regular browser and drive with Selenium or
         | similar solutions? Or does that not address the kind of
         | problems you're facing?
        
           | djbusby wrote:
           | I created a dedicated chrome profile (--user-data-dir) signed
           | in to a few sites and then drive it, with visible window from
           | scripts.
           | 
           | Does all my crawling, it goes very slow, it's never trigger
           | the bot detectors.
        
           | pkiv wrote:
           | The newest version of headless chrome actually runs the same
           | code as a "regular browser":
           | https://developer.chrome.com/docs/chromium/new-headless
        
           | tzs wrote:
           | I used to semi-automate access to some sites by using
           | Selenium with a non-headless browser. These were sites where
           | there were just one or two pages where I wanted some
           | automation to fill out a form or scrape some data, and they
           | frequently made changes to the home page that made it hard to
           | automate navigating from the home page to the pages I wanted
           | to automate.
           | 
           | The idea was to have a script use Selenium to launch non-
           | headless Chrome and then wait:                 driver =
           | Chrome()       driver.get("https://example.org")
           | input("Press enter when ready")
           | 
           | I could then manually deal with logging in, answering any
           | CAPTCHA that came up, and navigate to the page I wanted to
           | run my automation. Then I could press "enter" in my terminal
           | and my script would continue.
           | 
           | That used to work fine, but then on sites using Cloudflare's
           | CAPTCHA it stopped working. Solving the CAPTCHA would just
           | result in another CAPTCHA.
           | 
           | I tried an alternative Selenium Chrome driver that was
           | supposed to be more stealthy, and tried setting various flags
           | that were supposed to make it so JavaScript could not tell
           | that Selenium was there, and those worked for a while, but
           | then they stopped working.
           | 
           | The results were similar using Selenium with Firefox.
           | 
           | I also tried Puppeteer, with Chromium and Firefox, and they
           | too could not get past the CAPTCHA loops.
           | 
           | I then tried Playwright. With Chromium and Webkit that got
           | the CAPTCHA loops. With Firefox it actually worked. I didn't
           | even see the CAPTCHA. The non-interactive check for not being
           | a bot passed.
           | 
           | Still, the whole approach seems fragile. I don't know if
           | Firefox/Playwright working was due to some fundamental
           | difference between Firefox and the others or just Cloudflare
           | having not yet gotten around to dealing with it.
        
         | dns_snek wrote:
         | Are there any stories you're willing to share, any tough nuts
         | you've had to crack to improve some aspect of operations,
         | whether it be reliability, performance, bot detection evasion,
         | or something else completely?
         | 
         | I've only dealt with scraping on a small scale and I quickly
         | realized that running "browsers as a service" is a pain in the
         | ass, they're not exactly lightweight, they like to get "stuck",
         | balloon in memory or some such.
         | 
         | I imagine your business will be quite successful if reliability
         | is good and the price is right!
        
           | pkiv wrote:
           | I gave a lightning talk on headless chrome here that is worth
           | checking out!
           | 
           | https://www.youtube.com/watch?v=vs-qzlW9M50&t=726s
        
       | Avamander wrote:
       | This sounds a lot like Abuse as a Service.
       | 
       | Trying to defend against malicious bots is tedious at best,
       | impossible at the worst. I don't really see how lowering the bar
       | would be a net positive. People will start using methods that
       | will cause more collateral damage and just reduce user freedoms.
       | 
       | My guess is that this only increases the push towards attestation
       | and attestation-like approaches. Login walls and PAT/Privacy Pass
       | are just the start.
        
         | Spivak wrote:
         | Damned if you do, damned if you don't. The only way to mitigate
         | the attestation future will be legislation so you might as well
         | do while it lasts.
         | 
         | If you can't bypass bot detection now because in response
         | they'll make the bot detection harder then there's really no
         | winning is there?
        
           | Avamander wrote:
           | Even if you or I might find attestation against user freedoms
           | nobody is going to legislate attestation away. The
           | alternatives to the vast majority are way worse.
        
         | realusername wrote:
         | Device attestation doesn't solve the bot problem either as seen
         | on Android.
         | 
         | It just annoys browsers and OS with a smaller marketshare to
         | the point that I'm wondering if it's even legal with antitrust
         | legislation.
        
           | jeroenhd wrote:
           | It will, but only on Android recent, locked-down Android
           | devices. The first step in device attestation will probably
           | allow the Android click farms to continue, but once it's in
           | place, restrictions will just become stricter over time.
        
             | realusername wrote:
             | The clickfarms nowadays already use tons of non modified
             | devices built on racks so I really don't see how it's going
             | to make a dent.
             | 
             | Devices are way too cheap for an attestation system to work
             | to counter bots
        
               | Avamander wrote:
               | Not only click farms, these devices are also used for SMS
               | toll fraud and other similar attacks. SafetyNet's
               | attestation only helps to the extent that you can ban a
               | specific model/maker.
        
               | realusername wrote:
               | Yes that's basically it, you can ban a specific device
               | but that's like removing water from the ocean with a
               | bucket.
               | 
               | Betting on device attestation really means betting that
               | computing devices will become more and more expensive in
               | the future since the device cost is the only blocker
               | created by the attestation.
               | 
               | And if there's one thing I'm expecting, it's that's not
               | going to happen, device prices will continue to decline.
        
           | Etheryte wrote:
           | It could, but not in a way any one of us would want it. This
           | specific attested device isn't marked as legit by any of the
           | big players? Into the blocked list you go. This is very
           | similar to what we see today with email providers for
           | example. If you start running a new server, you start with a
           | negative reputation and have to climb out, you don't start
           | from neutral.
        
             | judge2020 wrote:
             | > If you start running a new server, you start with a
             | negative reputation and have to climb out, you don't start
             | from neutral.
             | 
             | Not really. If you start a new server from an IP range with
             | known bad actors, sure. Many have tried and failed to run
             | their own mail servers from Digitalocean, vultr, or even
             | more dedicated-esque server hosts and pretty quickly see
             | how much of a hassle doing so is.
             | 
             | But if you buy a v4 /24 from some reputable or old company
             | and get it assigned to your own AS, you won't have negative
             | reputation and will be fairly successful as long as you
             | have spf/dkim/dmarc set up properly.
        
               | dns_snek wrote:
               | > But if you buy a v4 /24 from some reputable or old
               | company and get it assigned to your own AS
               | 
               | Doesn't that have an extremely high barrier to entry? A
               | legal entity, significant ongoing fees, and some
               | infrastructure?
               | 
               | I'm under the impression that this is entirely out of
               | reach for anyone just trying to run an email server at
               | home for personal use. I would love to be wrong!
        
               | Avamander wrote:
               | Bulk sending (including transactional) indeed is out of
               | reach in that sense, but it's still not too difficult to
               | pull off on a smaller scale.
        
               | throwaway81523 wrote:
               | > if you buy a v4 /24 from some reputable or old company
               | and get it assigned to your own AS, you won't have
               | negative reputation
               | 
               | until you give yourself negative reputation by running
               | abuse bots from that range.
        
         | jeroenhd wrote:
         | Services like these are why massive blocks of IP addresses end
         | up firewalled off, and why Cloudflare/Google CAPTCHAs are
         | absolutely everywhere.
         | 
         | Normally, you can just block non-consumer ISPs, but this site
         | offers "residential proxy" services (basically a botnet-as-a-
         | service), which means that now consumer IP ranges need to be
         | selectively blocked as well.
         | 
         | I think PAT/Privacy Pass will solve this problem as far as
         | normal users are concerned ("normal" meaning "running Windows,
         | macOS, iOS, or Android, on devices with hardware attestation
         | capabilities"), but soon enough we'll arrive in an age where
         | you can only visit so many web pages before you've exceeded
         | your daily internet allowance.
        
         | CatWChainsaw wrote:
         | Maybe getting rid of Abuse As A Service is a positive use case
         | for Autonomous Killer War Drones as a Service!
        
           | vetrom wrote:
           | The expression is a bit different but this is literally a
           | (minor) piece of worldbuilding in Daniel Suarez's 'Daemon'
           | (2006).
        
         | buildfocus wrote:
         | Attestation creates all sorts of challenges absolutely, but it
         | actually doesn't really help here - there's plenty of services
         | doing this on real devices, and no real challenges to doing so
         | at the prices people will pay. The overhead is the one-off cost
         | of the cheapest legitimate device that will attest, split
         | between the many users to get near 100% utilisation over a
         | large period. Once you look at cheap androids & PCs, this gets
         | very cheap indeed.
         | 
         | The only solution is actual paid services, or real-user
         | verification (and deduplication, which means little privacy,
         | which means legal problems in much of the world) for free
         | accounts.
         | 
         | If you publish something on the internet for free, you _must_
         | accept that people can read it automatically. You can make it
         | difficult, make it a bit more expensive, but at the end of the
         | day paid data means paywalls.
        
       | unnouinceput wrote:
       | The never-ending game of walls and ladders
        
       | jastingo wrote:
       | What are the legitimate (i.e. legal) use cases for a product such
       | as this?
       | 
       | I agree with another comment that called this "Abuse as a
       | Service". It seems to me this product's design goal is nothing
       | more than to circumvent measures site owners take to prevent
       | abuse of their site and run a sustainable business.
        
         | BoorishBears wrote:
         | I used their previously available bot detection defeat to add
         | an import feature to my website: Users could link to their
         | creation on another site and my site would scrape the publicly
         | available content so they wouldn't need to re-enter all their
         | data
         | 
         | I've used their product many times actually, and I'm shocked on
         | Hacker News of all places no one's thinking of anything besides
         | abuse. How often is it useful to get information from a webpage
         | and apply it in a new context? Then think of how often said
         | webpage is behind a Cloudflare bot detector.
        
           | daemonw wrote:
           | If it's the user's data, then under GDPR the other site is
           | obligated to provide a way for them to download/transfer it,
           | specifically with this use case in mind.
           | 
           | They are completely in the right to block you though, you're
           | not the owner of that data, you might be breaking their TOS.
        
             | yjftsjthsd-h wrote:
             | > If it's the user's data, then under GDPR the other site
             | is obligated to provide a way for them to download/transfer
             | it, specifically with this use case in mind.
             | 
             | In Europe, if the company is actually following the law, in
             | theory yes.
             | 
             | > They are completely in the right to block you though,
             | you're not the owner of that data, you might be breaking
             | their TOS.
             | 
             | IANAL, but AIUI that's definitely not true in the United
             | States and I suspect similar ideas hold elsewhere:
             | https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
        
               | Klonoar wrote:
               | There are a litany of posts on this very site that detail
               | why HiQ vs LinkedIn is more nuanced than you're making it
               | out to be. HiQ didn't ultimately have the slam dunk win
               | that people think they did.
        
             | BoorishBears wrote:
             | This is non-sequitur to my comment:
             | 
             | - GDPR doesn't require it be a convenient export. Users
             | want to paste a link on my site, a click a button, and have
             | it magically appear. Not fill out a form, dump their entire
             | account and sift through that.
             | 
             | - I never opined on the validity of blocking bots
             | 
             | - I never opined on if it's breaking their TOS
             | 
             | Abuse implies a harmfulness. Giving users a quick import
             | option from already public data isn't harmful.
        
         | causalmodels wrote:
         | Bots acting on the behalf of users should not be blocked but we
         | have spent several decades treating bots (except for the
         | googlebot) as bad.
         | 
         | Like if I want to programmatically unsubscribe from a
         | subscription, why should I have to do it myself?
        
           | theamk wrote:
           | That's a bad example, "programmatically unsubscribing" means
           | giving spammers information that this address is alive. A
           | much better solution is to report the unwanted email as SPAM,
           | so the sender's reputation takes a hit.
           | 
           | (and for that 1% of the cases where the address is not a
           | spammer and user knows it, they can just hit "unsubscribe"
           | manually)
        
         | PathfinderBot wrote:
         | What about something like Nitter? Archiving? Adversarial
         | bridging between different platforms? Automation?
         | 
         | How will well-behaved scrapers undermine the sustainability of
         | a business? I guess adblocking is one, but we can already do
         | that with uBlock and that's legal. Or adversarial bridging, but
         | that only serves to boost competition.
         | 
         | In other words, the question is flipped; why would well-behaved
         | (i.e. non-DDoSing) scrapers be illegal?
        
           | judge2020 wrote:
           | Legality isn't the question here. If you want to speak to the
           | legality, anyone circumventing a robots txt that explicitly
           | has your bot's user-agent and 'disallow: *' is unauthorized
           | access (I imagine it's more nuanced for 'user-agent: *'). No
           | website is required to allow anyone to visit and can
           | discriminate against any client or software any way they
           | want.
        
             | yjftsjthsd-h wrote:
             | > Legality isn't the question here.
             | 
             | The question was literally,
             | 
             | > What are the legitimate (i.e. legal) use cases for a
             | product such as this?
        
         | Szpadel wrote:
         | for stuff I use similar self-hosted solution: detecting when
         | kid lessons are available on local portal. but to be fair
         | cheapest option here ($200) isn't usable for non-business usage
         | 
         | Ps: context why I need automation for such thing: those lessons
         | are really popular and are announced at unpredictable time /
         | there might be another spot when someone resigns
        
         | tzs wrote:
         | I've got a couple of things I've used browser automation tools
         | for:
         | 
         | * I want to automate (or at least semi-automate) downloading
         | bank statements. I've got ~14 accounts (checking, savings,
         | credit card, IRA, investment, HSA) across 7 financial
         | institutions.
         | 
         | It's tedious to go download statements from all of them
         | manually.
         | 
         | * I want to save stories from FanFiction.net (FFN) for offline
         | reading. FFN's terms allow automation as long as it doesn't
         | operate faster than a human [1].
         | 
         | [1] From their TOS:
         | 
         | > You agree not to use or launch any automated system,
         | including without limitation, "robots," "spiders," or "offline
         | readers," that accesses the Service in a manner that sends more
         | request messages to the FanFiction.Net servers in a given
         | period of time than a human can reasonably produce in the same
         | period by using a conventional on-line web browser.
        
           | iamacyborg wrote:
           | > I want to automate (or at least semi-automate) downloading
           | bank statements. I've got ~14 accounts (checking, savings,
           | credit card, IRA, investment, HSA) across 7 financial
           | institutions.
           | 
           | Could you not shoot an email to those institutions asking for
           | a copy of the documents?
        
         | heipei wrote:
         | Scanning for malicious and phishing websites. These types of
         | sites are just enjoying the ease of free services like
         | Cloudflare to block automated analysis tools and tailor their
         | phishing campaigns to very specific geographical locations and
         | user groups.
        
         | codedokode wrote:
         | I think they should introduce request rate limits per
         | IP/domain, for example max 1 parallel request. In this case
         | there will be no significant load, but the data can be scraped.
         | 
         | Scraping is important for example, to monitor competitors'
         | prices to see the opportunity to raise your own prices.
         | 
         | And let's not forget that Google does a lot more scraping than
         | anyone else and has ridiculous profits from it.
        
         | acaloiar wrote:
         | I'm not a customer, but I have a use case that in my opinion
         | should be legal.
         | 
         | For years I've used my own terminal UI player (di-tui) for
         | di.fm. At some point in the not-so-recent past, di.fm added
         | Cloudflare's WAF, which prevents me from using one of my app's
         | features: managing channel favorites within the app.
         | 
         | To be clear, I'm a paying di.fm customer, and my app only works
         | for paying customers. But now my preferred method of listening
         | to di.fm is slightly hamstrung because Cloudflare's WAF sits
         | between me and little string token available to every browser
         | that accesses di.fm (even non-paying customers).
        
         | jonatron wrote:
         | You sell jastingo(tm) brand widgets, but you notice fakes are
         | being sold on eBay, Amazon, AliExpress. You set up a scraper to
         | search for jastingo widgets every day on every marketplace
         | site, but you get blocked. So now you need an unblocker to
         | enforce your copyright/trademark/patent.
        
           | iamacyborg wrote:
           | Why does it need to be that complicated? If marketplaces are
           | selling fakes, get your lawyer to send them a letter.
        
             | jonatron wrote:
             | What if you have 10 brands, with 10 products each, and
             | there's 10 marketplaces.
        
         | qingcharles wrote:
         | I'm genuinely scraping a certain social network that doesn't
         | have an accessible API to do what I need. My user is logged in
         | and I just automate the logged-in browser to go to the pages
         | and get the data I need into a console so I can get the data I
         | require.
         | 
         | If there was an accessible API to do what I need, I wouldn't do
         | this because scraping sucks. I have to write 100 JavaScript
         | edge cases to handle all the times the host's servers fail in
         | very weird ways. Plus, walking DOMs on these shitty sites with
         | 10,000 nested divs is not fun. GPT helps with this.
         | 
         | It's net-positive for the host though, as I upload a lot of
         | valuable content that their users genuinely like, but it sucks
         | that I have to be sneaky to get the data I need.
        
       | LewisJEllis wrote:
       | I used to work for a bot mitigation vendor 8-10 years ago,
       | researching / implementing signals for this cat and mouse game.
       | 
       | This will get you past some very mundane bot detections, but
       | really this is like, the very first baby step of a long rabbit
       | hole.
       | 
       | The people who are taking this game seriously are 5-10 years
       | ahead of this step. Good luck -\\_(tsu)_/-
        
         | danpalmer wrote:
         | Yeah that's my reading. No way this is passing Akamai bot
         | detection.
         | 
         | There are lots of signals like timings, user tapping and
         | scrolling behaviour, signed sessions cookies that represent
         | browsing flows which may be legitimate or not. And that's all
         | assuming you're on a good looking IP. To do this you need a
         | large supply of residential IPs which then leads to the dodgy
         | underworld of botnets.
         | 
         | I'd be surprised if this works for anything but the most basic
         | bot protection, this is an advanced space.
         | 
         | If it does work for those cases, they should be either keeping
         | it quiet and making bank, or boasting about having a secret
         | sauce, not basic stuff like this.
         | 
         | Edit: for apps, Akamai provides an SDK that uses things like
         | your motion data to create a signature that suggests that
         | you're a real user. This signature is either injected into API
         | requests or into a webview session. I'm sure it's crackable if
         | you dedicate significant reverse engineering resources to it,
         | but then you've got to crack every version, crack every other
         | implementation from other companies, etc. Non-starter.
        
           | PathfinderBot wrote:
           | I'm just an outsider, but I wonder if these sort of bot-
           | blocker-bypass services can be done by employing people to go
           | to those pages manually.
        
           | dns_snek wrote:
           | When it comes to "motion data", are you referring to the
           | client's movement around the website, i.e. which URLs they
           | access and in which order? Their taps, clicks, scroll events,
           | and other inputs?
        
             | spondylosaurus wrote:
             | Motion data (especially in the context of an SDK, which I
             | assume means they're talking about in-app environments
             | rather than browser environments) usually refers to
             | gyroscope readouts on a phone or tablet.
             | 
             | A device that stays rock-steady throughout an entire
             | browsing session isn't necessarily suspicious on its own--
             | for example, you could have your phone laid flat on the
             | table while your browse with your pointer finger--but it
             | can be a useful tell in combination with other suspicious
             | factors.
        
               | dns_snek wrote:
               | That was my first thought as well but I couldn't believe
               | it. Doesn't access to gyroscope data require some sort of
               | permission prompt in the browser?
        
               | chankstein38 wrote:
               | Asking out of almost total ignorance of this field, what
               | prevents someone from running a script that sets their
               | agent to a phone browser and then sending fake gyro data?
               | Surely there's a way to emulate enough to make it look
               | like a phone that's being held by someone, right? We can
               | do realistic camera shake in blender to the point where
               | something looks like it's being held by a person, why
               | couldn't we fake minute movements like the device is
               | held?
               | 
               | Why do we even need an actual device? We can emulate if
               | we even need to and set our headers to look like we're
               | coming from a device browser.
        
             | Klonoar wrote:
             | It's likely referring to gyroscope, intended to guard
             | against racks of devices sitting somewhere that don't
             | physically move.
             | 
             | Humans move their devices even when typing.
        
         | imathrowaway wrote:
         | It's a never ending battle. Lots of tools aren't as
         | sophisticated as they claim to be, and the current mechanisms
         | inject a lot of "other stuff" that can be easily found. We're
         | trying to do this in a more novel way that's faster, less prone
         | to needing frequent updates, and is more akin to how actual
         | users interact with browsers. Definitely work to be done, but
         | it's exciting to see, and I appreciate the good luck!
         | 
         | Source: Founder@browserless.io
        
       | NorwegianDude wrote:
       | I've been fighting tailor made bots for over 15 years, and I
       | can't say I'm a fan of people trying to circumvent it. In some
       | cases it might not matter, but in others it can actually ruin a
       | lot of things.
       | 
       | Let's say I have a few 100 Gbps connections that is mostly idle,
       | is it fine if I direct them at browserless? No? Exactly, that
       | kind of traffic is not wanted.
        
         | codedokode wrote:
         | Browserless could implement limits on their side, for example,
         | allow only 1 parallel request to IP/domain. This way there will
         | be no significant load, but the data will be scraped.
        
           | thrwwycbr wrote:
           | As if IPs matter in the golden age of IoT botnets and
           | residential app malware.
           | 
           | You can't even block the amount of subnets that's coming for
           | you in a DDoS attack, thinking a human is able to keep up
           | with something like this is pretty blindsighted and naive.
           | The differentiation of network protocols and relay attacks
           | alone is way too slow to be mitigated in most systems.
        
         | iamacyborg wrote:
         | It's all fun and games until you get a taste of your own
         | medicine.
        
       | reactordev wrote:
       | I'm sorry if this offends you, but it's companies like you that I
       | despise. WAFs are a good thing. Your bot is not authorized. We'll
       | be blocking all IP ciders from you, your referrers, and your
       | service. Thanks.
        
         | yard2010 wrote:
         | No. Long live the piracy. Good bye and thanks for the fish.
        
       | internetter wrote:
       | 90% of the problems solved with puppeteer could be instead solved
       | by replaying a few API requests. Cheaper, less wasteful, faster,
       | in some cases, harder to detect (WAF is rarely deployed on APIs
       | because there are much less heuristics)
        
       | iambateman wrote:
       | A lot of negativity about scraped data, in comments...but both
       | OpenAI and Google rely on scraping as their foundation
       | for...everything.
       | 
       | If smaller indie hackers are going to build useful competitors,
       | they're going to need to scrape.
       | 
       | I understand that a lot of scraping is from spammers, but not
       | all.
        
         | bdcravens wrote:
         | > If smaller indie hackers are going to build useful
         | competitors
         | 
         | Any useful competitor will have to scale well beyond the point
         | of indie hacker, where someone is (eventually) getting rich.
         | That doesn't make for a great argument for bypassing consent.
        
           | troyvit wrote:
           | Depends on how you describe "useful." Kagi is a profitable
           | search company that satisfies its customers while competing
           | directly with Google. They aren't useful in the sense that
           | they're taking marketshare from Google, but they are useful
           | to their user base. I hope they continue being able to scrape
           | sites with the same freedom as other search engines.
           | 
           | Edit: As mmcclure points out Kagi actually doesn't index
           | much. So that's a bad example that might invalidate my point.
        
             | mmcclure wrote:
             | I'm a happy Kagi user, but they might be somewhat a bad
             | example. I suspect their actual indexing work is pretty
             | limited, and they only seem to claim that it's for indexing
             | "niche" stuff.
             | 
             | > Our data includes anonymized API calls to traditional
             | search indexes like Google, Yandex, Mojeek and Brave,
             | specialized search engines like Marginalia, and sources of
             | vertical information like Apple, Wikipedia, Open Meteo,
             | Yelp, TripAdvisor and other APIs. Typically every search
             | query on Kagi will call a number of different sources at
             | the same time, all with the purpose of bringing the best
             | possible search results to the user.
             | 
             | source: https://help.kagi.com/kagi/search-details/search-
             | sources.htm...
        
               | troyvit wrote:
               | Thanks I updated my comment. Also, damn that's cool.
        
         | mmcclure wrote:
         | Scraping itself isn't a problem. Scraping content where people
         | are actively trying to stop it from being scraped is where
         | things get questionable.                   both OpenAI and
         | Google rely on scraping as their foundation for...everything.
         | 
         | OpenAI and Google also (claim to) respect the most basic bot
         | deterrent, robots.txt, with their crawlers. If someone doesn't
         | want their content scraped to be included by search indexes or
         | LLMs/models, then that certainly includes "smaller indie
         | hackers."
        
       | silentsanctuary wrote:
       | I want to try Browserless but it doesn't appear to be possible to
       | do so without signing up for a free trial of a $200/month plan -
       | have they especially hidden this or am I just blind?
        
         | pkiv wrote:
         | They recently increased their pricing quite a bit. We're
         | looking to offer a much more affordable pay-as-you-go pricing
         | model at https://browserbase.com/
         | 
         | Feel free to shoot me an email if you're interested in trying
         | it out! paul@browserbase.com
        
       | generalizations wrote:
       | I'm surprised that puppeteer uses that default window size. Last
       | time I did a scraping project, I assumed randomizing the window
       | size would be table stakes.
        
         | darepublic wrote:
         | Puppeteer is not created for scraping per se right. I mean it
         | could be, but isn't puppeteer created/maintained by google? So
         | it's not meant to fly under the radar and avoid detection, it's
         | just a naive tool for automating websites that would never try
         | to block it.
        
       | darepublic wrote:
       | Personally I see a legit use of this as avoiding detection with
       | small amounts of traffic. I just want a browser bot to do things
       | in my name, and not DDOS, but just spare me from having to
       | personally go and tap my phone screen.
        
       ___________________________________________________________________
       (page generated 2024-02-27 23:00 UTC)