[HN Gopher] Show HN: Crawl a modern website to a zip, serve the ...
___________________________________________________________________
Show HN: Crawl a modern website to a zip, serve the website from
the zip
Author : unlog
Score : 87 points
Date : 2024-06-10 11:47 UTC (11 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| unlog wrote:
| I'm a big fan of modern JavaScript frameworks, but I don't fancy
| SSR, so have been experimenting with crawling myself for
| uploading to hosts without having to do SSR. This is the result
| ryanwaldorf wrote:
| What's the benefit of this approach?
| unlog wrote:
| That the page HTML is indexable by search engines without
| having to render in the server. Such unzipping to a directory
| served by nginx. You may also use it for archiving purposes, or
| for having backups.
| brudgers wrote:
| One _possible_ advantage _I_ see is it creates a 1:1
| correspondence between a website and a file.
|
| If what I care about is the website (and that's usually going
| to be the case), then there's a single familiar box containing
| all the messy details. I don't have to see all the files I want
| to ignore.
|
| That might not be a benefit for you and not having used it, it
| is only a theoretical benefit in an unlikely future for me.
|
| But just from the title of the post, I had a very clear
| piccture of the mechanism and it was not obvious why I would
| want to _start_ with a different mechanism (barring ordinary
| issues with open source projects).
|
| But that's me and your mileage may vary.
| earleybird wrote:
| Understood that this is early times, are you considering a
| licence to release it under?
| unlog wrote:
| Sure, I forgot about that detail, what license do you suggest?
| yodon wrote:
| MIT and BSD seem to be by far the most common these days (I
| generally do MIT personally)
| unlog wrote:
| added
| meiraleal wrote:
| AGPL
| yodon wrote:
| +1 can you add a license file with an MIT or BSD or whatever
| your preference is? (It's very cool. I'd love to help with this
| project, I'm guessing others would as well)
| jayemar wrote:
| Is the output similar to a web archive file (warc)?
| unlog wrote:
| That's something I haven't explored, sounds interesting. Right
| now, the zip file contains a mirror of the files found on the
| website when loaded in a browser. I've ended with a zip file by
| luck, as mirroring to the file system gives predictable
| problems with file/folder names.
| toomuchtodo wrote:
| https://news.ycombinator.com/item?id=40628958
|
| https://github.com/internetarchive/warctools
| jll29 wrote:
| Microsoft Interne Explorer (no, I'm not using it personally) had
| a file format called *.mht that could save a HTML page together
| with all the files referenced from it like inline images. I
| believe you could not store more than one page in one *.mht file,
| though, so your work could be seen as an extension.
|
| Although UNIX philosophy posits that it's good to have many small
| files, I like your idea for its contribution to reduceing clutter
| (imagine running 'tree' in both scenarios) and also avoiding
| running out of inodes in some file systems (maybe less of a
| problem nowadays in general, not sure as I haven't generated
| millions of tiny files recently).
| jdougan wrote:
| .mht us alive and well. It is a MIME wrapper on the files and
| is generated by Chrome, Opera, and Edge's save option "Webpage
| as single file" and defaults to an extension of .mhtml.
|
| When I last looked Firefox didn't support it natively but it
| was a requested feature.
| rrr_oh_man wrote:
| _> When I last looked Firefox didn 't support it natively but
| it was a requested feature._
|
| That sounds familiar, unfortunately
| unlog wrote:
| Yes! You know, I was considering this the previous couple of
| days, was looking around on how to construct a `mhtml` file for
| serving all the files at the same time. Unrelated to this
| project, I had the use case of a client wanting to keep an
| offline version of one of my projects.
|
| > Although UNIX philosophy posits that it's good to have many
| small files, I like your idea for its contribution to reduceing
| clutter (imagine running 'tree' in both scenarios) and also
| avoiding running out of inodes in some file systems (maybe less
| of a problem nowadays in general, not sure as I haven't
| generated millions of tiny files recently).
|
| Pretty rare for any website to have many files, as they
| optimize to have as few files as possible(less network
| requests, which could be slower than just shipping a big file).
| I have crawled react docs as a test, and it's a zip file of
| 147mb with 3.803 files (including external resources).
|
| https://docs.solidjs.com/ is 12mb (including external
| resources) with 646 files
| kitd wrote:
| Nice work!
|
| Obligatory mention for RedBean, the server that you can package
| along with all assets (incl db, scripting and TLS support) into a
| single multi-platform binary.
|
| https://redbean.dev/
| tamimio wrote:
| How is it different from HTTrack? And what about the media
| extension, which one is supported and which one isn't? Sometimes
| when I download some sites with HTTrack, some files just get
| ignored because by default it looks only for default types, and
| you have to manually add them there.
| unlog wrote:
| Big fan of HTTrack! reminds me of the old days and makes me sad
| of the current state of the web.
|
| I am not sure if HTTTrack progressed from fetching resources,
| long time since I used it for last time, but what my project
| does, is spin a real web-browser(chrome in headless mode which
| means it's hidden) and then it lets the JavaScript on that
| website execute, which means it will display/generate some
| fancy HTML that you can then save it as is into an index.html.
| It saves all kind of files, it doesn't care the extension or
| mime types of files, it tries to save them all.
| tamimio wrote:
| > It saves all kind of files, it doesn't care the extension
| or mime types of files, it tries to save them all.
|
| That's awesome to know, I will give it a try. One website I
| remember I tried to download and has all sorts of animations
| with .riv extension and it didn't work well with HTTrack,
| will try it with this soon, thanks for sharing it!
| renegat0x0 wrote:
| My 5 cents:
|
| - status codes 200-299 are all OK
|
| - status codes 300-399 are redirects, and also can be OK
| eventually
|
| - 403 in my experience occurs quite often, where it is not an
| error, but suggestion that your user agent is not OK
|
| - robots.txt should be scanned to check if any resource is
| prohibited, or if there are speed requirements. It is always
| better to be _nice_. I plan to add something like that and also
| missing it in my project
|
| - It would be interesting to generate hash from app, and update
| only if hash is different?
| unlog wrote:
| Status codes, I am displaying the list because mostly on a
| JavaScript driven application you don't want other codes than
| 200 (besides media).
|
| I thought about robots.txt but as this is a software that you
| are supposed to run against your own website I didn't consider
| it worthy. You have a point on speed requirements and
| prohibited resources (but is not like skipping over them will
| add any security).
|
| I haven't put much time/effort into an update step. Currently,
| it resumes if the process exited via checkpoints(it saves
| current state every 250 URLs, if any is missing then it can
| continue, else it will be done)
|
| Thanks, btw what's your project!? Share!
| meiraleal wrote:
| Seems like a very useful tool to impersonate websites. Useful to
| scammers. Why would someone crawl their own website?
| kej wrote:
| Scammers don't need this to copy an existing website, and I
| could see plenty of legitimate uses. Maybe you're redoing the
| website but want to keep the previous site around somewhere, or
| you want an easy way to archive a site for future reference.
| Maybe you're tired of paying for some hosted CMS but you want
| to keep the content.
| meiraleal wrote:
| All the scenarios you described can be achieved by having
| access to the source code, assuming you own it.
| kej wrote:
| Lots of things are possible with access to source code that
| are still easier when someone writes a tool for that
| scenario.
| meiraleal wrote:
| Crawling a build you already have isn't one of them
| p4bl0 wrote:
| The website in question may be a dynamic website (e.g.,
| WordPress, MediaWiki, or whatever other CMS or custom web
| app) and you either want a snapshot of it for backup, or
| you run it locally and want un static copy to host it
| elsewhere that only support static files.
| unlog wrote:
| > Why would someone crawl their own website?
|
| My main use case is that the docs site https://pota.quack.uy/ ,
| Google cannot index it properly. On here
| https://www.google.com/search?q=site%3Apota.quack.uy you will
| see some tiles/descriptions won't match what the content of the
| page is about. As the full site is rendered client side, via
| JavaScript, I can just crawl myself and save the html output to
| actual files. Then, I can serve that content with nginx or any
| other web server without having to do the expensive thing of
| SSR via nodejs. Not to mention, that being able to do SSR with
| modern JavaScript frameworks is not trivial, and requires
| engineering time.
| ivolimmen wrote:
| So a modern chm (Microsoft Compiled HTML help file)
| nox101 wrote:
| I'm curious about this vs a .har file
|
| In Chrome Devtools, network tab, last icon that looks like an
| arrow pointing into a dish (Export har file)
|
| I guess a .har file as ton more data though I used it to extract
| data from sites that either intensionally or unintentionally make
| it hard to get data. For example, signing up for an apartment the
| apartment management site used pdf.js and provided no way to save
| the PDF. So saved the .har file and extracted the PDF.
___________________________________________________________________
(page generated 2024-06-10 23:01 UTC)