hngopher.com

       [HN Gopher] Creating a serverless function to scrape web pages m...
       ___________________________________________________________________
        
       Creating a serverless function to scrape web pages metadata
        
       Author : mmazzarolo
       Score  : 25 points
       Date   : 2021-06-06 16:44 UTC (6 hours ago)
        
 (HTM) web link (mmazzarolo.com)
 (TXT) w3m dump (mmazzarolo.com)
        
       | superasn wrote:
       | One of the biggest challenges I've faced in scraping data has
       | always been that most websites are now blacklisting almost all
       | datacentre IPs including Amazon, Azure, etc blocks and if you
       | really need to have anything useful out of it the only way to do
       | it is using residential IP addresses (which are most often super
       | expensive and also often times shady, think sdk in a mobile game
       | proxying your traffic shady)
       | 
       | It almost makes me feel that I am breaking the law when scraping
       | a site, yet web scraping is on of the most basic programming
       | things.
       | 
       | Just imagine where Google would be if it was a new startup and an
       | existing giant like Cloudflare or Cisco blocked all attempts of
       | access.
        
         | notsureaboutpg wrote:
         | Why can't you use a VPN and route scraping traffic through
         | that? Most websites are accessible over VPNs...
        
         | mmazzarolo wrote:
         | > It almost makes me feel that I am breaking the law when
         | scraping a site, yet web scraping is on of the most basic
         | programming things.
         | 
         | Yeah, same for me.
         | 
         | Regarding the denylisting, I guess it depends on what is being
         | scraped and how often the scraping happens? I'm maintaining a
         | remote jobs aggregator website and I've never been blocked
         | before (but I'm not scraping more than ~5 times per day the
         | same web page). And with a caching strategy, I think that even
         | a scrape-as-a-service API like the one I'm building in the
         | article should be "kinda" safe (besides edge cases that brute
         | force the cache constantly, like by adding random query-
         | params)?
        
         | extra88 wrote:
         | > most websites are now blacklisting almost all datacentre IPs
         | including Amazon, Azure, etc
         | 
         | "Most" sounds like an exaggeration. Wouldn't this also create
         | problems for virtual desktop services like Amazon Workspaces?
         | 
         | > It almost makes me feel that I am breaking the law when
         | scraping a site
         | 
         | You might be violating their copyright, it depends what you do
         | with it. If you overdo it, you could also degrade their service
         | for actual users.
        
           | slver wrote:
           | > You might be violating their copyright, it depends what you
           | do with it.
           | 
           | Google is not new to such complaints (news sites). Everything
           | is very relative.
        
           | superasn wrote:
           | A lot of websites are using cloudflare which does make
           | scraping quite difficult (just by default I think).
           | 
           | Spoofing your user agent is a must if you need to do anything
           | nowadays.
           | 
           | To your second point that same would be applicable to Google
           | and Bing and any other search engine. Even if you follow
           | robots.txt and consume an equal or lesser bandwidth it does
           | not matter much if you aren't an established player.
        
       | slver wrote:
       | Honestly I'm only starting to accept how stupid it is that we
       | call datacenter services "cloud" now, I just can't bear the
       | stupidity of calling running script on a server "serverless
       | function".
        
       ___________________________________________________________________
       (page generated 2021-06-06 23:02 UTC)