[HN Gopher] Show HN: Crawl a modern website to a zip, serve the ...
       ___________________________________________________________________
        
       Show HN: Crawl a modern website to a zip, serve the website from
       the zip
        
       Author : unlog
       Score  : 87 points
       Date   : 2024-06-10 11:47 UTC (11 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | unlog wrote:
       | I'm a big fan of modern JavaScript frameworks, but I don't fancy
       | SSR, so have been experimenting with crawling myself for
       | uploading to hosts without having to do SSR. This is the result
        
       | ryanwaldorf wrote:
       | What's the benefit of this approach?
        
         | unlog wrote:
         | That the page HTML is indexable by search engines without
         | having to render in the server. Such unzipping to a directory
         | served by nginx. You may also use it for archiving purposes, or
         | for having backups.
        
         | brudgers wrote:
         | One _possible_ advantage _I_ see is it creates a 1:1
         | correspondence between a website and a file.
         | 
         | If what I care about is the website (and that's usually going
         | to be the case), then there's a single familiar box containing
         | all the messy details. I don't have to see all the files I want
         | to ignore.
         | 
         | That might not be a benefit for you and not having used it, it
         | is only a theoretical benefit in an unlikely future for me.
         | 
         | But just from the title of the post, I had a very clear
         | piccture of the mechanism and it was not obvious why I would
         | want to _start_ with a different mechanism (barring ordinary
         | issues with open source projects).
         | 
         | But that's me and your mileage may vary.
        
       | earleybird wrote:
       | Understood that this is early times, are you considering a
       | licence to release it under?
        
         | unlog wrote:
         | Sure, I forgot about that detail, what license do you suggest?
        
           | yodon wrote:
           | MIT and BSD seem to be by far the most common these days (I
           | generally do MIT personally)
        
             | unlog wrote:
             | added
        
           | meiraleal wrote:
           | AGPL
        
         | yodon wrote:
         | +1 can you add a license file with an MIT or BSD or whatever
         | your preference is? (It's very cool. I'd love to help with this
         | project, I'm guessing others would as well)
        
       | jayemar wrote:
       | Is the output similar to a web archive file (warc)?
        
         | unlog wrote:
         | That's something I haven't explored, sounds interesting. Right
         | now, the zip file contains a mirror of the files found on the
         | website when loaded in a browser. I've ended with a zip file by
         | luck, as mirroring to the file system gives predictable
         | problems with file/folder names.
        
           | toomuchtodo wrote:
           | https://news.ycombinator.com/item?id=40628958
           | 
           | https://github.com/internetarchive/warctools
        
       | jll29 wrote:
       | Microsoft Interne Explorer (no, I'm not using it personally) had
       | a file format called *.mht that could save a HTML page together
       | with all the files referenced from it like inline images. I
       | believe you could not store more than one page in one *.mht file,
       | though, so your work could be seen as an extension.
       | 
       | Although UNIX philosophy posits that it's good to have many small
       | files, I like your idea for its contribution to reduceing clutter
       | (imagine running 'tree' in both scenarios) and also avoiding
       | running out of inodes in some file systems (maybe less of a
       | problem nowadays in general, not sure as I haven't generated
       | millions of tiny files recently).
        
         | jdougan wrote:
         | .mht us alive and well. It is a MIME wrapper on the files and
         | is generated by Chrome, Opera, and Edge's save option "Webpage
         | as single file" and defaults to an extension of .mhtml.
         | 
         | When I last looked Firefox didn't support it natively but it
         | was a requested feature.
        
           | rrr_oh_man wrote:
           | _> When I last looked Firefox didn 't support it natively but
           | it was a requested feature._
           | 
           | That sounds familiar, unfortunately
        
         | unlog wrote:
         | Yes! You know, I was considering this the previous couple of
         | days, was looking around on how to construct a `mhtml` file for
         | serving all the files at the same time. Unrelated to this
         | project, I had the use case of a client wanting to keep an
         | offline version of one of my projects.
         | 
         | > Although UNIX philosophy posits that it's good to have many
         | small files, I like your idea for its contribution to reduceing
         | clutter (imagine running 'tree' in both scenarios) and also
         | avoiding running out of inodes in some file systems (maybe less
         | of a problem nowadays in general, not sure as I haven't
         | generated millions of tiny files recently).
         | 
         | Pretty rare for any website to have many files, as they
         | optimize to have as few files as possible(less network
         | requests, which could be slower than just shipping a big file).
         | I have crawled react docs as a test, and it's a zip file of
         | 147mb with 3.803 files (including external resources).
         | 
         | https://docs.solidjs.com/ is 12mb (including external
         | resources) with 646 files
        
       | kitd wrote:
       | Nice work!
       | 
       | Obligatory mention for RedBean, the server that you can package
       | along with all assets (incl db, scripting and TLS support) into a
       | single multi-platform binary.
       | 
       | https://redbean.dev/
        
       | tamimio wrote:
       | How is it different from HTTrack? And what about the media
       | extension, which one is supported and which one isn't? Sometimes
       | when I download some sites with HTTrack, some files just get
       | ignored because by default it looks only for default types, and
       | you have to manually add them there.
        
         | unlog wrote:
         | Big fan of HTTrack! reminds me of the old days and makes me sad
         | of the current state of the web.
         | 
         | I am not sure if HTTTrack progressed from fetching resources,
         | long time since I used it for last time, but what my project
         | does, is spin a real web-browser(chrome in headless mode which
         | means it's hidden) and then it lets the JavaScript on that
         | website execute, which means it will display/generate some
         | fancy HTML that you can then save it as is into an index.html.
         | It saves all kind of files, it doesn't care the extension or
         | mime types of files, it tries to save them all.
        
           | tamimio wrote:
           | > It saves all kind of files, it doesn't care the extension
           | or mime types of files, it tries to save them all.
           | 
           | That's awesome to know, I will give it a try. One website I
           | remember I tried to download and has all sorts of animations
           | with .riv extension and it didn't work well with HTTrack,
           | will try it with this soon, thanks for sharing it!
        
       | renegat0x0 wrote:
       | My 5 cents:
       | 
       | - status codes 200-299 are all OK
       | 
       | - status codes 300-399 are redirects, and also can be OK
       | eventually
       | 
       | - 403 in my experience occurs quite often, where it is not an
       | error, but suggestion that your user agent is not OK
       | 
       | - robots.txt should be scanned to check if any resource is
       | prohibited, or if there are speed requirements. It is always
       | better to be _nice_. I plan to add something like that and also
       | missing it in my project
       | 
       | - It would be interesting to generate hash from app, and update
       | only if hash is different?
        
         | unlog wrote:
         | Status codes, I am displaying the list because mostly on a
         | JavaScript driven application you don't want other codes than
         | 200 (besides media).
         | 
         | I thought about robots.txt but as this is a software that you
         | are supposed to run against your own website I didn't consider
         | it worthy. You have a point on speed requirements and
         | prohibited resources (but is not like skipping over them will
         | add any security).
         | 
         | I haven't put much time/effort into an update step. Currently,
         | it resumes if the process exited via checkpoints(it saves
         | current state every 250 URLs, if any is missing then it can
         | continue, else it will be done)
         | 
         | Thanks, btw what's your project!? Share!
        
       | meiraleal wrote:
       | Seems like a very useful tool to impersonate websites. Useful to
       | scammers. Why would someone crawl their own website?
        
         | kej wrote:
         | Scammers don't need this to copy an existing website, and I
         | could see plenty of legitimate uses. Maybe you're redoing the
         | website but want to keep the previous site around somewhere, or
         | you want an easy way to archive a site for future reference.
         | Maybe you're tired of paying for some hosted CMS but you want
         | to keep the content.
        
           | meiraleal wrote:
           | All the scenarios you described can be achieved by having
           | access to the source code, assuming you own it.
        
             | kej wrote:
             | Lots of things are possible with access to source code that
             | are still easier when someone writes a tool for that
             | scenario.
        
               | meiraleal wrote:
               | Crawling a build you already have isn't one of them
        
               | p4bl0 wrote:
               | The website in question may be a dynamic website (e.g.,
               | WordPress, MediaWiki, or whatever other CMS or custom web
               | app) and you either want a snapshot of it for backup, or
               | you run it locally and want un static copy to host it
               | elsewhere that only support static files.
        
         | unlog wrote:
         | > Why would someone crawl their own website?
         | 
         | My main use case is that the docs site https://pota.quack.uy/ ,
         | Google cannot index it properly. On here
         | https://www.google.com/search?q=site%3Apota.quack.uy you will
         | see some tiles/descriptions won't match what the content of the
         | page is about. As the full site is rendered client side, via
         | JavaScript, I can just crawl myself and save the html output to
         | actual files. Then, I can serve that content with nginx or any
         | other web server without having to do the expensive thing of
         | SSR via nodejs. Not to mention, that being able to do SSR with
         | modern JavaScript frameworks is not trivial, and requires
         | engineering time.
        
       | ivolimmen wrote:
       | So a modern chm (Microsoft Compiled HTML help file)
        
       | nox101 wrote:
       | I'm curious about this vs a .har file
       | 
       | In Chrome Devtools, network tab, last icon that looks like an
       | arrow pointing into a dish (Export har file)
       | 
       | I guess a .har file as ton more data though I used it to extract
       | data from sites that either intensionally or unintentionally make
       | it hard to get data. For example, signing up for an apartment the
       | apartment management site used pdf.js and provided no way to save
       | the PDF. So saved the .har file and extracted the PDF.
        
       ___________________________________________________________________
       (page generated 2024-06-10 23:01 UTC)