[HN Gopher] Creating a serverless function to scrape web pages m...
___________________________________________________________________
Creating a serverless function to scrape web pages metadata
Author : mmazzarolo
Score : 25 points
Date : 2021-06-06 16:44 UTC (6 hours ago)
(HTM) web link (mmazzarolo.com)
(TXT) w3m dump (mmazzarolo.com)
| superasn wrote:
| One of the biggest challenges I've faced in scraping data has
| always been that most websites are now blacklisting almost all
| datacentre IPs including Amazon, Azure, etc blocks and if you
| really need to have anything useful out of it the only way to do
| it is using residential IP addresses (which are most often super
| expensive and also often times shady, think sdk in a mobile game
| proxying your traffic shady)
|
| It almost makes me feel that I am breaking the law when scraping
| a site, yet web scraping is on of the most basic programming
| things.
|
| Just imagine where Google would be if it was a new startup and an
| existing giant like Cloudflare or Cisco blocked all attempts of
| access.
| notsureaboutpg wrote:
| Why can't you use a VPN and route scraping traffic through
| that? Most websites are accessible over VPNs...
| mmazzarolo wrote:
| > It almost makes me feel that I am breaking the law when
| scraping a site, yet web scraping is on of the most basic
| programming things.
|
| Yeah, same for me.
|
| Regarding the denylisting, I guess it depends on what is being
| scraped and how often the scraping happens? I'm maintaining a
| remote jobs aggregator website and I've never been blocked
| before (but I'm not scraping more than ~5 times per day the
| same web page). And with a caching strategy, I think that even
| a scrape-as-a-service API like the one I'm building in the
| article should be "kinda" safe (besides edge cases that brute
| force the cache constantly, like by adding random query-
| params)?
| extra88 wrote:
| > most websites are now blacklisting almost all datacentre IPs
| including Amazon, Azure, etc
|
| "Most" sounds like an exaggeration. Wouldn't this also create
| problems for virtual desktop services like Amazon Workspaces?
|
| > It almost makes me feel that I am breaking the law when
| scraping a site
|
| You might be violating their copyright, it depends what you do
| with it. If you overdo it, you could also degrade their service
| for actual users.
| slver wrote:
| > You might be violating their copyright, it depends what you
| do with it.
|
| Google is not new to such complaints (news sites). Everything
| is very relative.
| superasn wrote:
| A lot of websites are using cloudflare which does make
| scraping quite difficult (just by default I think).
|
| Spoofing your user agent is a must if you need to do anything
| nowadays.
|
| To your second point that same would be applicable to Google
| and Bing and any other search engine. Even if you follow
| robots.txt and consume an equal or lesser bandwidth it does
| not matter much if you aren't an established player.
| slver wrote:
| Honestly I'm only starting to accept how stupid it is that we
| call datacenter services "cloud" now, I just can't bear the
| stupidity of calling running script on a server "serverless
| function".
___________________________________________________________________
(page generated 2021-06-06 23:02 UTC)