[HN Gopher] How to scrape and extract hyperlink networks with Be...
___________________________________________________________________
How to scrape and extract hyperlink networks with BeautifulSoup and
NetworkX
Author : spacejunkjim
Score : 46 points
Date : 2021-11-15 19:16 UTC (3 hours ago)
(HTM) web link (connectingfigures.com)
(TXT) w3m dump (connectingfigures.com)
| funnyflamigo wrote:
| I know some people think all scraping is bad or malicious. I'd
| like to point out this is a perfectly legitimate use case for it,
| in fact this is how Google Search operates.
|
| Web scraping done correctly should be barely noticeable if at all
| to the operators. Don't send 10,000 req/s, have aggressive
| delays, make your retries extremely generous, try to avoid pages
| or actions you know are "heavy". You don't need to update data
| from every product page every 5 minutes.
| tyingq wrote:
| My guess is that scraping is getting heavier because scrapers
| have to use headless browsers now. And so, probably downloading
| artifacts they don't need...because they can't tell what's
| needed or not, at least with js.
| rabuse wrote:
| This breaks with using standard web scraping methods (non-
| headless JS engines). Had to deal with this issue recently due to
| everything being a damn SPA now. Look into Selenium for running
| headless browsers if looking to scrape the modern web.
| kordlessagain wrote:
| Depending on the use case you might try imaging the page, then
| send the image to an ML model for full text before indexing. If
| you need links extracted, Selenium also supports parsing the
| assembled DOM:
| https://github.com/kordless/grub-2.0/tree/main/aperture
| ricardo81 wrote:
| I used to use mozrepl back in the day before FF Quantum (or a
| version nearby) broke it, worked great- run the browser(s) via
| telnet. Could run as many browsers as memory would allow.
___________________________________________________________________
(page generated 2021-11-15 23:02 UTC)