[HN Gopher] Katana: A crawling and spidering framework
       ___________________________________________________________________
        
       Katana: A crawling and spidering framework
        
       Author : feross
       Score  : 81 points
       Date   : 2022-11-08 20:03 UTC (2 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jmatthews wrote:
       | Was there any specific library this was inspired by, or a
       | specific use case it was built for besides the obvious generic
       | case?
        
       | niteshade wrote:
       | For "a next-generation crawling and spidering framework", it's a
       | little surprising to see no support for the WARC[1] format.
       | 
       | [1]: https://en.wikipedia.org/wiki/Web_ARChive
        
         | pr337h4m wrote:
         | Wonder why the Internet Archive never tried to build a web
         | search engine - their crawls of the entire web could be more
         | comprehensive than Google (assuming Google doesn't archive old
         | copies of websites)
        
           | graypegg wrote:
           | That's both really intriguing, and horrifying!
           | 
           | It's already _technically_ impossible to erase something from
           | the internet, but if they removed the barrier to knowing
           | where something was before in order to find it in the
           | archive, it would be truly impossible in every sense of the
           | word.
        
           | dredmorbius wrote:
           | Brewster Kahle, the IA's founder, did. It is called Alexa
           | Internet, and was sold to Amazon:
           | 
           | <https://en.wikipedia.org/wiki/Alexa_Internet>
           | 
           | <https://help.archive.org/help/wayback-machine-general-
           | inform...>
           | 
           | A condition of that sale was that Alexa would continue to
           | provide the results of its crawls, after a delay, to the
           | Internet Archive. Those crawls form a substantial portion of
           | IA's Wayback Machine archive.
           | 
           | I'm _not_ certain that those archive are ongoing, as Alexa
           | seems to have been largely shut down.
           | 
           | IA are a bit cagey on details, but I believe that there is a
           | general IA-based archival service. There's certainly the
           | "Save Page Now" feature:
           | https://web.archive.org/save/<URL>
           | 
           | And the independent but closely-cooperating ArchiveTeam (lead
           | by Jason Scott) tailors crawlers specific to endangered /
           | vulnerable online websites, its Warrior software:
           | 
           | <https://wiki.archiveteam.org/>
        
           | [deleted]
        
           | ddorian43 wrote:
           | Crawling should be the easiest part.
        
       | mrkeen wrote:
       | I wrote a crawler a few years ago. I fired it up recently but had
       | little luck in fetching pages. It looked like cloudflare was
       | protecting the site from me.
       | 
       | Are any of you other HNers finding the web increasingly difficult
       | to scrape from?
        
         | adamredwoods wrote:
         | Is it a 301 that keeps looping? I noticed my company's website
         | does that when I try to cURL, and I wondered if it was cookie
         | based, or how to get around it.
         | 
         | (EDIT: Yes, I figured it out. To get around the 301 loop. cURL
         | needs to save cookies)
        
         | bjord wrote:
         | If you're scraping with Python, try cloudscraper--among other
         | things(!), it supports JS rendering (basically the bare-minimum
         | check cloudflare does), without needing to run a full browser
         | in the background. It's built on requests, so integration (for
         | me, anyway) was pretty easy.
         | 
         | https://github.com/venomous/cloudscraper
        
       | datalopers wrote:
       | How does this handle headless? Does it just come with a baked
       | chrome binary?
        
         | thatwasunusual wrote:
         | https://github.com/projectdiscovery/katana#headless-mode
        
       | rouxz wrote:
       | What is "next gen" in this implementation? Chrome support?
       | 
       | IMO the hardest things in distributed crawling at scale are a
       | good URL frontier, priorities, rate limiting and things like
       | that, which are quite often overlooked.
        
         | dang wrote:
         | We took 'next gen' out of the title since it's borderline
         | clickbaity and tends to be a distraction.
        
         | 1vuio0pswjnm7 wrote:
         | Would be nice if HN could remove clickbait terms like "modern",
         | "next-generation", "blazingly fast", etc. These
         | characterisations only look dated if not silly when we look at
         | them years down the road.
        
           | yourapostasy wrote:
           | In jest: I'd give an allowance to any product that steps
           | right on the boundary of what we currently know as the
           | fundamental limits of physics. Like Shannon entropy for a
           | compression implementation. Or Planck length for processors.
        
             | staplung wrote:
             | Heheh.
             | 
             | (Planck length is 10e-35. Even the strong nuclear force
             | operates on a scale that's like 20 orders or magnitude
             | larger (10e-15). And a hugantuan electron? Forget about
             | it.)
        
           | [deleted]
        
           | sidpatil wrote:
           | I wish all technology naming would follow this rule.
           | 
           | Fast Ethernet is my favorite example.
        
           | harryvederci wrote:
           | Reminds me of "The New Cook Book"[0] in the kitchen of a
           | family member. That book is older than I am.
           | 
           | [0] (translated title)
        
       ___________________________________________________________________
       (page generated 2022-11-10 23:01 UTC)