[HN Gopher] Katana: A crawling and spidering framework
___________________________________________________________________
Katana: A crawling and spidering framework
Author : feross
Score : 81 points
Date : 2022-11-08 20:03 UTC (2 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jmatthews wrote:
| Was there any specific library this was inspired by, or a
| specific use case it was built for besides the obvious generic
| case?
| niteshade wrote:
| For "a next-generation crawling and spidering framework", it's a
| little surprising to see no support for the WARC[1] format.
|
| [1]: https://en.wikipedia.org/wiki/Web_ARChive
| pr337h4m wrote:
| Wonder why the Internet Archive never tried to build a web
| search engine - their crawls of the entire web could be more
| comprehensive than Google (assuming Google doesn't archive old
| copies of websites)
| graypegg wrote:
| That's both really intriguing, and horrifying!
|
| It's already _technically_ impossible to erase something from
| the internet, but if they removed the barrier to knowing
| where something was before in order to find it in the
| archive, it would be truly impossible in every sense of the
| word.
| dredmorbius wrote:
| Brewster Kahle, the IA's founder, did. It is called Alexa
| Internet, and was sold to Amazon:
|
| <https://en.wikipedia.org/wiki/Alexa_Internet>
|
| <https://help.archive.org/help/wayback-machine-general-
| inform...>
|
| A condition of that sale was that Alexa would continue to
| provide the results of its crawls, after a delay, to the
| Internet Archive. Those crawls form a substantial portion of
| IA's Wayback Machine archive.
|
| I'm _not_ certain that those archive are ongoing, as Alexa
| seems to have been largely shut down.
|
| IA are a bit cagey on details, but I believe that there is a
| general IA-based archival service. There's certainly the
| "Save Page Now" feature:
| https://web.archive.org/save/<URL>
|
| And the independent but closely-cooperating ArchiveTeam (lead
| by Jason Scott) tailors crawlers specific to endangered /
| vulnerable online websites, its Warrior software:
|
| <https://wiki.archiveteam.org/>
| [deleted]
| ddorian43 wrote:
| Crawling should be the easiest part.
| mrkeen wrote:
| I wrote a crawler a few years ago. I fired it up recently but had
| little luck in fetching pages. It looked like cloudflare was
| protecting the site from me.
|
| Are any of you other HNers finding the web increasingly difficult
| to scrape from?
| adamredwoods wrote:
| Is it a 301 that keeps looping? I noticed my company's website
| does that when I try to cURL, and I wondered if it was cookie
| based, or how to get around it.
|
| (EDIT: Yes, I figured it out. To get around the 301 loop. cURL
| needs to save cookies)
| bjord wrote:
| If you're scraping with Python, try cloudscraper--among other
| things(!), it supports JS rendering (basically the bare-minimum
| check cloudflare does), without needing to run a full browser
| in the background. It's built on requests, so integration (for
| me, anyway) was pretty easy.
|
| https://github.com/venomous/cloudscraper
| datalopers wrote:
| How does this handle headless? Does it just come with a baked
| chrome binary?
| thatwasunusual wrote:
| https://github.com/projectdiscovery/katana#headless-mode
| rouxz wrote:
| What is "next gen" in this implementation? Chrome support?
|
| IMO the hardest things in distributed crawling at scale are a
| good URL frontier, priorities, rate limiting and things like
| that, which are quite often overlooked.
| dang wrote:
| We took 'next gen' out of the title since it's borderline
| clickbaity and tends to be a distraction.
| 1vuio0pswjnm7 wrote:
| Would be nice if HN could remove clickbait terms like "modern",
| "next-generation", "blazingly fast", etc. These
| characterisations only look dated if not silly when we look at
| them years down the road.
| yourapostasy wrote:
| In jest: I'd give an allowance to any product that steps
| right on the boundary of what we currently know as the
| fundamental limits of physics. Like Shannon entropy for a
| compression implementation. Or Planck length for processors.
| staplung wrote:
| Heheh.
|
| (Planck length is 10e-35. Even the strong nuclear force
| operates on a scale that's like 20 orders or magnitude
| larger (10e-15). And a hugantuan electron? Forget about
| it.)
| [deleted]
| sidpatil wrote:
| I wish all technology naming would follow this rule.
|
| Fast Ethernet is my favorite example.
| harryvederci wrote:
| Reminds me of "The New Cook Book"[0] in the kitchen of a
| family member. That book is older than I am.
|
| [0] (translated title)
___________________________________________________________________
(page generated 2022-11-10 23:01 UTC)