[HN Gopher] The /unblock API from Browserless: dodging bot detec...
___________________________________________________________________
The /unblock API from Browserless: dodging bot detection as a
service
Author : keepamovin
Score : 99 points
Date : 2024-02-27 17:30 UTC (5 hours ago)
(HTM) web link (www.browserless.io)
(TXT) w3m dump (www.browserless.io)
| adsthrowaway99 wrote:
| For a while I worked in ads, and the specific team I was in put
| us in a position of being on both sides of this. We were tasked
| with monitoring advertisements for fraud, which meant we both had
| to catch bot traffic, but also we had to check ads for
| scams/phishing/etc. So on the one hand we were trying to plug
| holes in our own bot detection for botnets trying to drive up
| impression numbers to get website operators paid more, but on the
| other hand we were trying to bypass bot detecting on the part of
| phishing page authors (who would do things like display a totally
| plausible e-commerce page if they think the ad viewer is a bot,
| but a bank of america phishing page if they think it's a real
| user).
| Avamander wrote:
| > other hand we were trying to bypass bot detecting on the part
| of phishing page authors
|
| An approach I recently saw gated the phishing page behind a
| Google Account login page. User that was logged in wouldn't
| even notice the brief redirect. Scanners would just get stuck.
| I hope they've patched it by now though...
| yjftsjthsd-h wrote:
| A coincidence of how I work with my browser profiles is that
| the browser (profile) that does most of my web browsing isn't
| logged into Google. It's nice to see that this has extra
| benefits:) Now if you'll excuse me, I need to go move my
| entire computing environment into a VM so that viruses will
| think they're being analyzed and refuse to run... (Kidding
| but also not)
| spondylosaurus wrote:
| Ha, I think we may have worked at the same company (or at least
| at competing companies)....
| nhggfu wrote:
| someone got tweak we can use for puppeteer to achieve this
| effect? [kinda LOL @ asking people who SCRAPE DATA for free to
| pay for an API, no?]
| ohthatsnotright wrote:
| You can use Firefox nightly and create a webextension using
| their internal APIs. Gives you access to injecting non-
| synthetic or faked events (they look indistinguishable since
| they are emitted in the same code path that the real input
| device uses) and scraping the DOM with the added advantage of
| not exposing the automation flags that Firefox has when you use
| Puppeteer and the like in headless, automated mode.
|
| edit: Firefox Nightly because it gives you much deeper access
| to Firefox internals.
| nhggfu wrote:
| interesting. would love to see some examples / code
| ohthatsnotright wrote:
| I can't give examples (I did the work for a company so I
| don't own it) but starting at [0] [1] should get you to the
| internals. It definitely carries some risk of Firefox
| Nightly just randomly breaking when it updates but this
| solution got us beyond quite a number of bot detectors so
| if you're willing to risk Firefox Nightly it's possible. We
| didn't need nearly the flexibility something like Puppeteer
| has since we were injecting rrweb into the target page to
| record the DOM mutations and sending those to other
| clients.
|
| That will give you a good start on where to even begin
| looking. web-ext is insanely handy for launching Firefox
| with a WebExtension already installed (but I did a lot of
| other work around this to productize the offering).
|
| [0] https://firefox-source-
| docs.mozilla.org/toolkit/components/e... [1]
| https://firefox-source-docs.mozilla.org/overview/gecko.html
| generalizations wrote:
| What are some sites that have good bot detection? I'd be
| interested in just seeing how my scraper holds up against
| them.
| michaelt wrote:
| There's actually an (admittedly small) data scraping industry.
|
| Let's say you are Wal-Mart, and you'd like to know which of the
| products you sell are available cheaper at Target. Or which
| neighbourhoods they deliver to that you don't, or which stores
| have longer opening hours than yours, or whatever.
|
| You can't legally exchange data with Target directly, that
| would risk making you an illegal price-fixing cartel. You _can_
| legally visit their website, but you don 't feel like matching
| up 10,000+ products.
|
| Instead, you call up a business intelligence firm who has
| already scraped your site and theirs and matched the products
| up. They'll send you a CSV, for a price.
| mmcclure wrote:
| I can't help but feel a little conflicted about things like this
| in the current climate. On one hand, as a developer trying to
| just get some updated stuff into a spreadsheet, this looks
| extremely useful.
|
| On the other hand, there are a lot of people right now that want
| to keep stuff accessible to humans and not have it scraped for
| models. I know the lid's completely off the box on that front so
| it's probably useless to fight it, but seeing _products_
| explicitly designed to circumvent bot prevention feels kinda bad.
| imathrowaway wrote:
| Agreed, we've seen a lot of governing bodies _now_ start
| scraping to fight scam sites and "fake" sites, so there's a lot
| of legitimate use-cases popping up. It's not necessarily just
| about gathering e-commerce data or backfilling lack of an API
| anymore.
|
| Source: Founder@browserless.io
| Matheus28 wrote:
| So how will you ensure your service is only used for "good"?
| 67j67j67n wrote:
| no services remain "good" so why ask the question. it gets
| ignored or answered but then a decade or two down the line
| the philosophy gets scrapped, a story as old as time.
| chankstein38 wrote:
| I still don't understand why Google publicly removed "Do
| No Evil" from their company. It's one thing to not follow
| it but to actively remove it feels weird. Like they
| wanted to say "HEY Y'ALL GET READY FOR EVIL!"
| ziddoap wrote:
| You can go to https://abc.xyz/investor/google-code-of-
| conduct/ and ctrl+f "evil", and you'll see it's still
| there.
| qingcharles wrote:
| Can confirm this Google document contains the word
| "evil."
| scarface_74 wrote:
| I fail to see any legitimate use case for getting around
| protections that web dm developers put in to help prevent
| this.
| mmcclure wrote:
| I know this is slightly off topic, but I have to ask: Why not
| create a new account rather than use the same self-described
| throwaway account you used to ask about raising prices (for
| browserless?)[1]? For what it's worth, I initially assumed
| you likely weren't legitimate and didn't engage, so if you
| want to keep contributing to threads as yourself you might
| want to consider using a new/different username.
|
| [1] https://news.ycombinator.com/item?id=21271224
| pkiv wrote:
| In the end, the best way to avoid being blocked is to be a good
| actor. All of these hacks won't stop someone who's determined to
| prevent access (ie: LinkedIn).
|
| That's actually one of the reasons why I started
| https://browserbase.com/. Maintaining headless browser
| infrastructure can be such a pain. I've spent a lot of time
| managing headless chrome fleets at scale, so happy to answer any
| questions.
| Etheryte wrote:
| If I understand correctly, a lot of the issues you can run into
| with regards to blocking come from the fact that you're using a
| headless browser. Past a certain point, wouldn't it be less
| work to use a regular browser and drive with Selenium or
| similar solutions? Or does that not address the kind of
| problems you're facing?
| djbusby wrote:
| I created a dedicated chrome profile (--user-data-dir) signed
| in to a few sites and then drive it, with visible window from
| scripts.
|
| Does all my crawling, it goes very slow, it's never trigger
| the bot detectors.
| pkiv wrote:
| The newest version of headless chrome actually runs the same
| code as a "regular browser":
| https://developer.chrome.com/docs/chromium/new-headless
| tzs wrote:
| I used to semi-automate access to some sites by using
| Selenium with a non-headless browser. These were sites where
| there were just one or two pages where I wanted some
| automation to fill out a form or scrape some data, and they
| frequently made changes to the home page that made it hard to
| automate navigating from the home page to the pages I wanted
| to automate.
|
| The idea was to have a script use Selenium to launch non-
| headless Chrome and then wait: driver =
| Chrome() driver.get("https://example.org")
| input("Press enter when ready")
|
| I could then manually deal with logging in, answering any
| CAPTCHA that came up, and navigate to the page I wanted to
| run my automation. Then I could press "enter" in my terminal
| and my script would continue.
|
| That used to work fine, but then on sites using Cloudflare's
| CAPTCHA it stopped working. Solving the CAPTCHA would just
| result in another CAPTCHA.
|
| I tried an alternative Selenium Chrome driver that was
| supposed to be more stealthy, and tried setting various flags
| that were supposed to make it so JavaScript could not tell
| that Selenium was there, and those worked for a while, but
| then they stopped working.
|
| The results were similar using Selenium with Firefox.
|
| I also tried Puppeteer, with Chromium and Firefox, and they
| too could not get past the CAPTCHA loops.
|
| I then tried Playwright. With Chromium and Webkit that got
| the CAPTCHA loops. With Firefox it actually worked. I didn't
| even see the CAPTCHA. The non-interactive check for not being
| a bot passed.
|
| Still, the whole approach seems fragile. I don't know if
| Firefox/Playwright working was due to some fundamental
| difference between Firefox and the others or just Cloudflare
| having not yet gotten around to dealing with it.
| dns_snek wrote:
| Are there any stories you're willing to share, any tough nuts
| you've had to crack to improve some aspect of operations,
| whether it be reliability, performance, bot detection evasion,
| or something else completely?
|
| I've only dealt with scraping on a small scale and I quickly
| realized that running "browsers as a service" is a pain in the
| ass, they're not exactly lightweight, they like to get "stuck",
| balloon in memory or some such.
|
| I imagine your business will be quite successful if reliability
| is good and the price is right!
| pkiv wrote:
| I gave a lightning talk on headless chrome here that is worth
| checking out!
|
| https://www.youtube.com/watch?v=vs-qzlW9M50&t=726s
| Avamander wrote:
| This sounds a lot like Abuse as a Service.
|
| Trying to defend against malicious bots is tedious at best,
| impossible at the worst. I don't really see how lowering the bar
| would be a net positive. People will start using methods that
| will cause more collateral damage and just reduce user freedoms.
|
| My guess is that this only increases the push towards attestation
| and attestation-like approaches. Login walls and PAT/Privacy Pass
| are just the start.
| Spivak wrote:
| Damned if you do, damned if you don't. The only way to mitigate
| the attestation future will be legislation so you might as well
| do while it lasts.
|
| If you can't bypass bot detection now because in response
| they'll make the bot detection harder then there's really no
| winning is there?
| Avamander wrote:
| Even if you or I might find attestation against user freedoms
| nobody is going to legislate attestation away. The
| alternatives to the vast majority are way worse.
| realusername wrote:
| Device attestation doesn't solve the bot problem either as seen
| on Android.
|
| It just annoys browsers and OS with a smaller marketshare to
| the point that I'm wondering if it's even legal with antitrust
| legislation.
| jeroenhd wrote:
| It will, but only on Android recent, locked-down Android
| devices. The first step in device attestation will probably
| allow the Android click farms to continue, but once it's in
| place, restrictions will just become stricter over time.
| realusername wrote:
| The clickfarms nowadays already use tons of non modified
| devices built on racks so I really don't see how it's going
| to make a dent.
|
| Devices are way too cheap for an attestation system to work
| to counter bots
| Avamander wrote:
| Not only click farms, these devices are also used for SMS
| toll fraud and other similar attacks. SafetyNet's
| attestation only helps to the extent that you can ban a
| specific model/maker.
| realusername wrote:
| Yes that's basically it, you can ban a specific device
| but that's like removing water from the ocean with a
| bucket.
|
| Betting on device attestation really means betting that
| computing devices will become more and more expensive in
| the future since the device cost is the only blocker
| created by the attestation.
|
| And if there's one thing I'm expecting, it's that's not
| going to happen, device prices will continue to decline.
| Etheryte wrote:
| It could, but not in a way any one of us would want it. This
| specific attested device isn't marked as legit by any of the
| big players? Into the blocked list you go. This is very
| similar to what we see today with email providers for
| example. If you start running a new server, you start with a
| negative reputation and have to climb out, you don't start
| from neutral.
| judge2020 wrote:
| > If you start running a new server, you start with a
| negative reputation and have to climb out, you don't start
| from neutral.
|
| Not really. If you start a new server from an IP range with
| known bad actors, sure. Many have tried and failed to run
| their own mail servers from Digitalocean, vultr, or even
| more dedicated-esque server hosts and pretty quickly see
| how much of a hassle doing so is.
|
| But if you buy a v4 /24 from some reputable or old company
| and get it assigned to your own AS, you won't have negative
| reputation and will be fairly successful as long as you
| have spf/dkim/dmarc set up properly.
| dns_snek wrote:
| > But if you buy a v4 /24 from some reputable or old
| company and get it assigned to your own AS
|
| Doesn't that have an extremely high barrier to entry? A
| legal entity, significant ongoing fees, and some
| infrastructure?
|
| I'm under the impression that this is entirely out of
| reach for anyone just trying to run an email server at
| home for personal use. I would love to be wrong!
| Avamander wrote:
| Bulk sending (including transactional) indeed is out of
| reach in that sense, but it's still not too difficult to
| pull off on a smaller scale.
| throwaway81523 wrote:
| > if you buy a v4 /24 from some reputable or old company
| and get it assigned to your own AS, you won't have
| negative reputation
|
| until you give yourself negative reputation by running
| abuse bots from that range.
| jeroenhd wrote:
| Services like these are why massive blocks of IP addresses end
| up firewalled off, and why Cloudflare/Google CAPTCHAs are
| absolutely everywhere.
|
| Normally, you can just block non-consumer ISPs, but this site
| offers "residential proxy" services (basically a botnet-as-a-
| service), which means that now consumer IP ranges need to be
| selectively blocked as well.
|
| I think PAT/Privacy Pass will solve this problem as far as
| normal users are concerned ("normal" meaning "running Windows,
| macOS, iOS, or Android, on devices with hardware attestation
| capabilities"), but soon enough we'll arrive in an age where
| you can only visit so many web pages before you've exceeded
| your daily internet allowance.
| CatWChainsaw wrote:
| Maybe getting rid of Abuse As A Service is a positive use case
| for Autonomous Killer War Drones as a Service!
| vetrom wrote:
| The expression is a bit different but this is literally a
| (minor) piece of worldbuilding in Daniel Suarez's 'Daemon'
| (2006).
| buildfocus wrote:
| Attestation creates all sorts of challenges absolutely, but it
| actually doesn't really help here - there's plenty of services
| doing this on real devices, and no real challenges to doing so
| at the prices people will pay. The overhead is the one-off cost
| of the cheapest legitimate device that will attest, split
| between the many users to get near 100% utilisation over a
| large period. Once you look at cheap androids & PCs, this gets
| very cheap indeed.
|
| The only solution is actual paid services, or real-user
| verification (and deduplication, which means little privacy,
| which means legal problems in much of the world) for free
| accounts.
|
| If you publish something on the internet for free, you _must_
| accept that people can read it automatically. You can make it
| difficult, make it a bit more expensive, but at the end of the
| day paid data means paywalls.
| unnouinceput wrote:
| The never-ending game of walls and ladders
| jastingo wrote:
| What are the legitimate (i.e. legal) use cases for a product such
| as this?
|
| I agree with another comment that called this "Abuse as a
| Service". It seems to me this product's design goal is nothing
| more than to circumvent measures site owners take to prevent
| abuse of their site and run a sustainable business.
| BoorishBears wrote:
| I used their previously available bot detection defeat to add
| an import feature to my website: Users could link to their
| creation on another site and my site would scrape the publicly
| available content so they wouldn't need to re-enter all their
| data
|
| I've used their product many times actually, and I'm shocked on
| Hacker News of all places no one's thinking of anything besides
| abuse. How often is it useful to get information from a webpage
| and apply it in a new context? Then think of how often said
| webpage is behind a Cloudflare bot detector.
| daemonw wrote:
| If it's the user's data, then under GDPR the other site is
| obligated to provide a way for them to download/transfer it,
| specifically with this use case in mind.
|
| They are completely in the right to block you though, you're
| not the owner of that data, you might be breaking their TOS.
| yjftsjthsd-h wrote:
| > If it's the user's data, then under GDPR the other site
| is obligated to provide a way for them to download/transfer
| it, specifically with this use case in mind.
|
| In Europe, if the company is actually following the law, in
| theory yes.
|
| > They are completely in the right to block you though,
| you're not the owner of that data, you might be breaking
| their TOS.
|
| IANAL, but AIUI that's definitely not true in the United
| States and I suspect similar ideas hold elsewhere:
| https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
| Klonoar wrote:
| There are a litany of posts on this very site that detail
| why HiQ vs LinkedIn is more nuanced than you're making it
| out to be. HiQ didn't ultimately have the slam dunk win
| that people think they did.
| BoorishBears wrote:
| This is non-sequitur to my comment:
|
| - GDPR doesn't require it be a convenient export. Users
| want to paste a link on my site, a click a button, and have
| it magically appear. Not fill out a form, dump their entire
| account and sift through that.
|
| - I never opined on the validity of blocking bots
|
| - I never opined on if it's breaking their TOS
|
| Abuse implies a harmfulness. Giving users a quick import
| option from already public data isn't harmful.
| causalmodels wrote:
| Bots acting on the behalf of users should not be blocked but we
| have spent several decades treating bots (except for the
| googlebot) as bad.
|
| Like if I want to programmatically unsubscribe from a
| subscription, why should I have to do it myself?
| theamk wrote:
| That's a bad example, "programmatically unsubscribing" means
| giving spammers information that this address is alive. A
| much better solution is to report the unwanted email as SPAM,
| so the sender's reputation takes a hit.
|
| (and for that 1% of the cases where the address is not a
| spammer and user knows it, they can just hit "unsubscribe"
| manually)
| PathfinderBot wrote:
| What about something like Nitter? Archiving? Adversarial
| bridging between different platforms? Automation?
|
| How will well-behaved scrapers undermine the sustainability of
| a business? I guess adblocking is one, but we can already do
| that with uBlock and that's legal. Or adversarial bridging, but
| that only serves to boost competition.
|
| In other words, the question is flipped; why would well-behaved
| (i.e. non-DDoSing) scrapers be illegal?
| judge2020 wrote:
| Legality isn't the question here. If you want to speak to the
| legality, anyone circumventing a robots txt that explicitly
| has your bot's user-agent and 'disallow: *' is unauthorized
| access (I imagine it's more nuanced for 'user-agent: *'). No
| website is required to allow anyone to visit and can
| discriminate against any client or software any way they
| want.
| yjftsjthsd-h wrote:
| > Legality isn't the question here.
|
| The question was literally,
|
| > What are the legitimate (i.e. legal) use cases for a
| product such as this?
| Szpadel wrote:
| for stuff I use similar self-hosted solution: detecting when
| kid lessons are available on local portal. but to be fair
| cheapest option here ($200) isn't usable for non-business usage
|
| Ps: context why I need automation for such thing: those lessons
| are really popular and are announced at unpredictable time /
| there might be another spot when someone resigns
| tzs wrote:
| I've got a couple of things I've used browser automation tools
| for:
|
| * I want to automate (or at least semi-automate) downloading
| bank statements. I've got ~14 accounts (checking, savings,
| credit card, IRA, investment, HSA) across 7 financial
| institutions.
|
| It's tedious to go download statements from all of them
| manually.
|
| * I want to save stories from FanFiction.net (FFN) for offline
| reading. FFN's terms allow automation as long as it doesn't
| operate faster than a human [1].
|
| [1] From their TOS:
|
| > You agree not to use or launch any automated system,
| including without limitation, "robots," "spiders," or "offline
| readers," that accesses the Service in a manner that sends more
| request messages to the FanFiction.Net servers in a given
| period of time than a human can reasonably produce in the same
| period by using a conventional on-line web browser.
| iamacyborg wrote:
| > I want to automate (or at least semi-automate) downloading
| bank statements. I've got ~14 accounts (checking, savings,
| credit card, IRA, investment, HSA) across 7 financial
| institutions.
|
| Could you not shoot an email to those institutions asking for
| a copy of the documents?
| heipei wrote:
| Scanning for malicious and phishing websites. These types of
| sites are just enjoying the ease of free services like
| Cloudflare to block automated analysis tools and tailor their
| phishing campaigns to very specific geographical locations and
| user groups.
| codedokode wrote:
| I think they should introduce request rate limits per
| IP/domain, for example max 1 parallel request. In this case
| there will be no significant load, but the data can be scraped.
|
| Scraping is important for example, to monitor competitors'
| prices to see the opportunity to raise your own prices.
|
| And let's not forget that Google does a lot more scraping than
| anyone else and has ridiculous profits from it.
| acaloiar wrote:
| I'm not a customer, but I have a use case that in my opinion
| should be legal.
|
| For years I've used my own terminal UI player (di-tui) for
| di.fm. At some point in the not-so-recent past, di.fm added
| Cloudflare's WAF, which prevents me from using one of my app's
| features: managing channel favorites within the app.
|
| To be clear, I'm a paying di.fm customer, and my app only works
| for paying customers. But now my preferred method of listening
| to di.fm is slightly hamstrung because Cloudflare's WAF sits
| between me and little string token available to every browser
| that accesses di.fm (even non-paying customers).
| jonatron wrote:
| You sell jastingo(tm) brand widgets, but you notice fakes are
| being sold on eBay, Amazon, AliExpress. You set up a scraper to
| search for jastingo widgets every day on every marketplace
| site, but you get blocked. So now you need an unblocker to
| enforce your copyright/trademark/patent.
| iamacyborg wrote:
| Why does it need to be that complicated? If marketplaces are
| selling fakes, get your lawyer to send them a letter.
| jonatron wrote:
| What if you have 10 brands, with 10 products each, and
| there's 10 marketplaces.
| qingcharles wrote:
| I'm genuinely scraping a certain social network that doesn't
| have an accessible API to do what I need. My user is logged in
| and I just automate the logged-in browser to go to the pages
| and get the data I need into a console so I can get the data I
| require.
|
| If there was an accessible API to do what I need, I wouldn't do
| this because scraping sucks. I have to write 100 JavaScript
| edge cases to handle all the times the host's servers fail in
| very weird ways. Plus, walking DOMs on these shitty sites with
| 10,000 nested divs is not fun. GPT helps with this.
|
| It's net-positive for the host though, as I upload a lot of
| valuable content that their users genuinely like, but it sucks
| that I have to be sneaky to get the data I need.
| LewisJEllis wrote:
| I used to work for a bot mitigation vendor 8-10 years ago,
| researching / implementing signals for this cat and mouse game.
|
| This will get you past some very mundane bot detections, but
| really this is like, the very first baby step of a long rabbit
| hole.
|
| The people who are taking this game seriously are 5-10 years
| ahead of this step. Good luck -\\_(tsu)_/-
| danpalmer wrote:
| Yeah that's my reading. No way this is passing Akamai bot
| detection.
|
| There are lots of signals like timings, user tapping and
| scrolling behaviour, signed sessions cookies that represent
| browsing flows which may be legitimate or not. And that's all
| assuming you're on a good looking IP. To do this you need a
| large supply of residential IPs which then leads to the dodgy
| underworld of botnets.
|
| I'd be surprised if this works for anything but the most basic
| bot protection, this is an advanced space.
|
| If it does work for those cases, they should be either keeping
| it quiet and making bank, or boasting about having a secret
| sauce, not basic stuff like this.
|
| Edit: for apps, Akamai provides an SDK that uses things like
| your motion data to create a signature that suggests that
| you're a real user. This signature is either injected into API
| requests or into a webview session. I'm sure it's crackable if
| you dedicate significant reverse engineering resources to it,
| but then you've got to crack every version, crack every other
| implementation from other companies, etc. Non-starter.
| PathfinderBot wrote:
| I'm just an outsider, but I wonder if these sort of bot-
| blocker-bypass services can be done by employing people to go
| to those pages manually.
| dns_snek wrote:
| When it comes to "motion data", are you referring to the
| client's movement around the website, i.e. which URLs they
| access and in which order? Their taps, clicks, scroll events,
| and other inputs?
| spondylosaurus wrote:
| Motion data (especially in the context of an SDK, which I
| assume means they're talking about in-app environments
| rather than browser environments) usually refers to
| gyroscope readouts on a phone or tablet.
|
| A device that stays rock-steady throughout an entire
| browsing session isn't necessarily suspicious on its own--
| for example, you could have your phone laid flat on the
| table while your browse with your pointer finger--but it
| can be a useful tell in combination with other suspicious
| factors.
| dns_snek wrote:
| That was my first thought as well but I couldn't believe
| it. Doesn't access to gyroscope data require some sort of
| permission prompt in the browser?
| chankstein38 wrote:
| Asking out of almost total ignorance of this field, what
| prevents someone from running a script that sets their
| agent to a phone browser and then sending fake gyro data?
| Surely there's a way to emulate enough to make it look
| like a phone that's being held by someone, right? We can
| do realistic camera shake in blender to the point where
| something looks like it's being held by a person, why
| couldn't we fake minute movements like the device is
| held?
|
| Why do we even need an actual device? We can emulate if
| we even need to and set our headers to look like we're
| coming from a device browser.
| Klonoar wrote:
| It's likely referring to gyroscope, intended to guard
| against racks of devices sitting somewhere that don't
| physically move.
|
| Humans move their devices even when typing.
| imathrowaway wrote:
| It's a never ending battle. Lots of tools aren't as
| sophisticated as they claim to be, and the current mechanisms
| inject a lot of "other stuff" that can be easily found. We're
| trying to do this in a more novel way that's faster, less prone
| to needing frequent updates, and is more akin to how actual
| users interact with browsers. Definitely work to be done, but
| it's exciting to see, and I appreciate the good luck!
|
| Source: Founder@browserless.io
| NorwegianDude wrote:
| I've been fighting tailor made bots for over 15 years, and I
| can't say I'm a fan of people trying to circumvent it. In some
| cases it might not matter, but in others it can actually ruin a
| lot of things.
|
| Let's say I have a few 100 Gbps connections that is mostly idle,
| is it fine if I direct them at browserless? No? Exactly, that
| kind of traffic is not wanted.
| codedokode wrote:
| Browserless could implement limits on their side, for example,
| allow only 1 parallel request to IP/domain. This way there will
| be no significant load, but the data will be scraped.
| thrwwycbr wrote:
| As if IPs matter in the golden age of IoT botnets and
| residential app malware.
|
| You can't even block the amount of subnets that's coming for
| you in a DDoS attack, thinking a human is able to keep up
| with something like this is pretty blindsighted and naive.
| The differentiation of network protocols and relay attacks
| alone is way too slow to be mitigated in most systems.
| iamacyborg wrote:
| It's all fun and games until you get a taste of your own
| medicine.
| reactordev wrote:
| I'm sorry if this offends you, but it's companies like you that I
| despise. WAFs are a good thing. Your bot is not authorized. We'll
| be blocking all IP ciders from you, your referrers, and your
| service. Thanks.
| yard2010 wrote:
| No. Long live the piracy. Good bye and thanks for the fish.
| internetter wrote:
| 90% of the problems solved with puppeteer could be instead solved
| by replaying a few API requests. Cheaper, less wasteful, faster,
| in some cases, harder to detect (WAF is rarely deployed on APIs
| because there are much less heuristics)
| iambateman wrote:
| A lot of negativity about scraped data, in comments...but both
| OpenAI and Google rely on scraping as their foundation
| for...everything.
|
| If smaller indie hackers are going to build useful competitors,
| they're going to need to scrape.
|
| I understand that a lot of scraping is from spammers, but not
| all.
| bdcravens wrote:
| > If smaller indie hackers are going to build useful
| competitors
|
| Any useful competitor will have to scale well beyond the point
| of indie hacker, where someone is (eventually) getting rich.
| That doesn't make for a great argument for bypassing consent.
| troyvit wrote:
| Depends on how you describe "useful." Kagi is a profitable
| search company that satisfies its customers while competing
| directly with Google. They aren't useful in the sense that
| they're taking marketshare from Google, but they are useful
| to their user base. I hope they continue being able to scrape
| sites with the same freedom as other search engines.
|
| Edit: As mmcclure points out Kagi actually doesn't index
| much. So that's a bad example that might invalidate my point.
| mmcclure wrote:
| I'm a happy Kagi user, but they might be somewhat a bad
| example. I suspect their actual indexing work is pretty
| limited, and they only seem to claim that it's for indexing
| "niche" stuff.
|
| > Our data includes anonymized API calls to traditional
| search indexes like Google, Yandex, Mojeek and Brave,
| specialized search engines like Marginalia, and sources of
| vertical information like Apple, Wikipedia, Open Meteo,
| Yelp, TripAdvisor and other APIs. Typically every search
| query on Kagi will call a number of different sources at
| the same time, all with the purpose of bringing the best
| possible search results to the user.
|
| source: https://help.kagi.com/kagi/search-details/search-
| sources.htm...
| troyvit wrote:
| Thanks I updated my comment. Also, damn that's cool.
| mmcclure wrote:
| Scraping itself isn't a problem. Scraping content where people
| are actively trying to stop it from being scraped is where
| things get questionable. both OpenAI and
| Google rely on scraping as their foundation for...everything.
|
| OpenAI and Google also (claim to) respect the most basic bot
| deterrent, robots.txt, with their crawlers. If someone doesn't
| want their content scraped to be included by search indexes or
| LLMs/models, then that certainly includes "smaller indie
| hackers."
| silentsanctuary wrote:
| I want to try Browserless but it doesn't appear to be possible to
| do so without signing up for a free trial of a $200/month plan -
| have they especially hidden this or am I just blind?
| pkiv wrote:
| They recently increased their pricing quite a bit. We're
| looking to offer a much more affordable pay-as-you-go pricing
| model at https://browserbase.com/
|
| Feel free to shoot me an email if you're interested in trying
| it out! paul@browserbase.com
| generalizations wrote:
| I'm surprised that puppeteer uses that default window size. Last
| time I did a scraping project, I assumed randomizing the window
| size would be table stakes.
| darepublic wrote:
| Puppeteer is not created for scraping per se right. I mean it
| could be, but isn't puppeteer created/maintained by google? So
| it's not meant to fly under the radar and avoid detection, it's
| just a naive tool for automating websites that would never try
| to block it.
| darepublic wrote:
| Personally I see a legit use of this as avoiding detection with
| small amounts of traffic. I just want a browser bot to do things
| in my name, and not DDOS, but just spare me from having to
| personally go and tap my phone screen.
___________________________________________________________________
(page generated 2024-02-27 23:00 UTC)