[HN Gopher] Avoiding bot detection: How to scrape the web withou...
___________________________________________________________________
Avoiding bot detection: How to scrape the web without getting
blocked?
Author : proszkinasenne2
Score : 90 points
Date : 2021-10-31 20:48 UTC (2 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| curun1r wrote:
| There's one technique that can be very useful in some
| circumstances that isn't mentioned. Put simply, some sites try to
| block all bots except for those from the major search engines.
| They don't want their content scraped, but they want the traffic
| that comes from search. In those cases, it's often possible to
| scrape the search engines instead using specialized queries
| designed to get the content you want into the blurb for each
| search result.
|
| This kind of indirect scraping can be useful for getting almost
| all the information you want from sites like LinkedIn that do
| aggressive scraping detection.
| amelius wrote:
| But won't the search engines block you after some limit has
| been reached?
| curun1r wrote:
| Eventually, but they're not very aggressive when it comes to
| bot detection. Simple IP rotation usually works.
| rfraile wrote:
| Datadome, PerimeterX, anyone tried ine if them?
| IceWreck wrote:
| Half of the short-links to cutt.ly aren't working. Why use short
| links in markdown ?
| yamakadi wrote:
| It's most likely for tracking clicks. Better to just search for
| the company names instead of clicking on the links in case they
| lead to unexpected places.
| rp1 wrote:
| It's very easy to install Chrome on a linux box and launch it
| with a whitelisted extension. You can run Xorg using the dummy
| driver and get a full Chrome instance (i.e. not headless). You
| can even enable the DevTools API programmatically. I don't see
| how this would be detectable, and probably a lot safer than
| downloading a random browser package from an unknown developer.
| bsamuels wrote:
| > I need to make a general remark to people who are evaluating
| (and/or) planning to introduce anti-bot software on their
| websites. Anti-bot software is nonsense. Its snake oil sold to
| people without technical knowledge for heavy bucks.
|
| If this guy got to experience how systemically bad the credential
| stuffing problem is, he'd probably take down the whole
| repository.
|
| None of these anti-bot providers give a shit about invading your
| privacy, tracking your every movements, or whatever other power
| fantasy that can be imagined. Nobody pays those vendors $10m/year
| to frustrate web crawler enthusiasts, they do it to stop
| credential stuffing.
| melony wrote:
| The gold standard is residential IP. It is not cheap but its
| effectiveness is indisputable.
| northwest65 wrote:
| Back when we had to scrape airline websites to get the deals
| they withheld for themselves, residential IP was indeed the
| way. Once the cottoned on to it and blocked id, you'd simply
| cycle the ADSL model, get a new IP, and off you'd go again.
|
| Now the best part... one division (big team) of our company
| worked for the (national carrier) airline , one division of
| our company worked for the resellers (we had a single grad
| allocated to web scraping). The airline threw ridiculous
| dollars at trying to stop it, and we just used a caffeine
| fueled nerd to keep it running. It wasn't all fun though,
| they'd often release their new anti scraping stuff on a
| Friday afternoon. They were less than impressed when they
| learnt who the 'enemy' was. Good times!
| 1cvmask wrote:
| What do you mean by deals withheld for themselves?
| northwest65 wrote:
| Most flights are available through the airline booking
| systems such as Sabre. However, airlines might have
| flights available only on there own website at (sometimes
| massively) reduced cost, which needs to be booked through
| that site. So the web scraping became two parts, one to
| provide the data to our search engine to present to our
| customer (travel agent) customers. The second part was
| then we would book via the airlines website with the
| details provided by our customer's customer.
| jonatron wrote:
| A residential IP would help for IP based detection. As the
| Readme mentions, there's also Javascript based detection. If,
| for example, your browser has navigator.webdriver set
| incorrectly, then you can still get blocked even on a
| residential IP.
| [deleted]
| devit wrote:
| If users using weak/reused passwords is your problem, just
| don't let users choose a password (generate it for them), or
| don't use passwords at all (send link by e-mail that adds a
| cookie), or use oauth login.
| Gigachad wrote:
| 2FA should be a requirement on everything now. And if your site
| can't for some reason or you don't want to deal with it, then
| limit your site to external login providers only.
|
| 2FA, especially app based, has been proven to work really
| really well.
| oxymoron wrote:
| Yeah, I used to work for one of the major anti-bot vendors.
| Customers weren't clueless. Nobody buys these solutions because
| they're so much fun, it's a cost center and they monitor their
| ROI quite closely. Credit card charge backs, impact to
| infrastructure, extra incurred cost due to underlying api's
| (like in the Airline industry in particular) etc are all
| reasons why bot mitigation is a better option than nothing for
| a lot of companies, even if it's not 100% effective.
| al2o3cr wrote:
| You use this software at your own risk. Some of them contain
| malwares just fyi
|
| LOL why post LINKS to them then? Flat-out irresponsible...
| you build a tool to automate social media accounts to manage ads
| more efficiently
|
| If by "manage" you mean "commit click fraud"
| adinosaur123 wrote:
| Are there any court cases that provide precedence regarding the
| legality of web scraping?
|
| I'm currently looking for ways to get real estate listings in a
| particular area and apparently the only real solution is the
| scrape the few big online listing sites.
| Grimm1 wrote:
| https://en.m.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
|
| That's one of the bigger ones. Unfortunately recent events
| means scraping is still a gray area.
| amelius wrote:
| Legal gray areas are perfect for growth hacking. Just look at
| Uber and AirBnb.
| omgwtfbyobbq wrote:
| Do you mean this case?
|
| https://en.m.wikipedia.org/wiki/Van_Buren_v._United_States
|
| I think it only applies to systems that aren't available to
| the general public, which in this case was the GCIC. Anything
| that is available to the public, even if it requires some
| sort of registration, would I think be legal to scrape. YMMV
| though.
| [deleted]
| adanto6840 wrote:
| I was involved in a scraping-related case, though in my
| situation we were scraping public domain data/facts/public
| domain media. Email me if you'd like additional info. :)
|
| More related to the submission content -- at the time we used
| rotating proxies, both in-house & external (ProxyMesh - still
| exists & only good things to say about it); they allowed us to
| "pin" multiple requests to an IP or to fetch a new IP, etc...
___________________________________________________________________
(page generated 2021-10-31 23:00 UTC)