[HN Gopher] How to scrape and extract hyperlink networks with Be...
       ___________________________________________________________________
        
       How to scrape and extract hyperlink networks with BeautifulSoup and
       NetworkX
        
       Author : spacejunkjim
       Score  : 46 points
       Date   : 2021-11-15 19:16 UTC (3 hours ago)
        
 (HTM) web link (connectingfigures.com)
 (TXT) w3m dump (connectingfigures.com)
        
       | funnyflamigo wrote:
       | I know some people think all scraping is bad or malicious. I'd
       | like to point out this is a perfectly legitimate use case for it,
       | in fact this is how Google Search operates.
       | 
       | Web scraping done correctly should be barely noticeable if at all
       | to the operators. Don't send 10,000 req/s, have aggressive
       | delays, make your retries extremely generous, try to avoid pages
       | or actions you know are "heavy". You don't need to update data
       | from every product page every 5 minutes.
        
         | tyingq wrote:
         | My guess is that scraping is getting heavier because scrapers
         | have to use headless browsers now. And so, probably downloading
         | artifacts they don't need...because they can't tell what's
         | needed or not, at least with js.
        
       | rabuse wrote:
       | This breaks with using standard web scraping methods (non-
       | headless JS engines). Had to deal with this issue recently due to
       | everything being a damn SPA now. Look into Selenium for running
       | headless browsers if looking to scrape the modern web.
        
         | kordlessagain wrote:
         | Depending on the use case you might try imaging the page, then
         | send the image to an ML model for full text before indexing. If
         | you need links extracted, Selenium also supports parsing the
         | assembled DOM:
         | https://github.com/kordless/grub-2.0/tree/main/aperture
        
         | ricardo81 wrote:
         | I used to use mozrepl back in the day before FF Quantum (or a
         | version nearby) broke it, worked great- run the browser(s) via
         | telnet. Could run as many browsers as memory would allow.
        
       ___________________________________________________________________
       (page generated 2021-11-15 23:02 UTC)