[HN Gopher] ArchiveBox/ArchiveBox: open-source self-hosted web a...
___________________________________________________________________
ArchiveBox/ArchiveBox: open-source self-hosted web archiving
Author : rcarmo
Score : 181 points
Date : 2021-07-15 09:07 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| encryptluks2 wrote:
| What is the advantage of this over something like Kiwix, or just
| using Playwright CLI? It seems useful but a bit unnecessary if
| just using to create Archive.org links.
| rcarmo wrote:
| This does _a lot more_ than just creating archive.org links. It
| saves the entire page contents (in multiple formats, including
| nicely searchable PDF and some embedded media) locally.
| soneil wrote:
| I use it like permanent bookmarks. I can go back to it and
| trust it'll still be there. And this isn't just "things
| disappear eventually" - for a specific example, I was working
| on something rather last-minute recently, and wanted to refer
| to a vendor's whitepaper - and their entire site was "down for
| maintenance" all weekend. So I went onto my archive and I still
| had a copy from my first pass over the topic. It's a free
| resource, I can't blame them and I can't complain - but if I
| can make sure it doesn't impact me, even better.
|
| I know other people would still have that tab open from 3 weeks
| ago, but I just don't work like that.
|
| I'm not going to complain about wayback/archive.org at all, but
| the nature of the beast is that there's certain requests they
| have to obey - and with my own offline, non-exposed equivalent,
| I don't (well, I do, but I simply don't receive them)
| sleavey wrote:
| > I use it like permanent bookmarks.
|
| I create browser bookmarks regularly but I wouldn't be
| bothered to SSH into my server to also tell it to grab a copy
| of the URL. Automating this with a browser plugin would be
| cool.
| soneil wrote:
| There is a web UI, so I have a JS "bookmarklet" in my
| toolbar. If you go to the web UI and Add a url, right at
| the bottom of the page there's a link you can bookmark to
| do the same. So that's my workflow - click Archive (or cmd-
| alt-1), then hit enter. Done.
| sleavey wrote:
| Aha, that's great! I might give this a go this weekend
| then.
| nikisweeting wrote:
| There's also a real browser extension in the works by one
| of our users: https://github.com/ArchiveBox/ArchiveBox/is
| sues/577#issuecom...
| braincoke wrote:
| Archivebox is also an awesome tool to create copies of a
| website. Whether you want to demonstrate a phishing attack or
| do a POC to integrate your product.
| rambambram wrote:
| I really need to try this out soon. I keep bumping into it
| online. What do people on HN use it for?
|
| What I do right now "to collect, save, and view sites you want to
| preserve offline" is by use of a Firefox plugin called
| WebScrapBook. Click-click-done, and I have a local searchable (!)
| copy of a webpage exactly as it looked in the browser. With
| styles and all, in one file. WebScrapBook is pretty highly
| configurable.
|
| In the future I would like to have a solution that doesn't
| require some Firefox plugin.
| thedanbob wrote:
| I've used it to save a lot of pages related to ham radio. I
| have several 30-40 year old radios and I'm afraid one day
| information about them will just drop off the internet.
| rambambram wrote:
| Cool! How is the saving experience? Is it quick? And looking
| up? Or is it more just 'saving for later use'?
| thedanbob wrote:
| Fairly quick, depending of course on the weight of the
| page(s) and how many archive methods you enable (I have
| several turned off because they are redundant for my
| purposes). The "adding to archive" interstitial still tends
| to time out which is a little annoying, but the actual
| archive process is backgrounded so it doesn't matter.
|
| The recommended install includes a search engine that works
| well, aside from a few false positives due to being fuzzy
| search. I don't have much in it yet so I can't say what the
| performance is like once you reach e.g. thousands of pages,
| but I imagine it would still perform well except maybe for
| mass operations like rebuilding the entire index.
| infogulch wrote:
| I think it would be neat to try integrating ArchiveBox with the
| enhanced history / browsing context extension Promnesia [1].
|
| [1]: https://github.com/karlicoss/promnesia
| blastro wrote:
| i have been archiving cool or interesting sites with this tool
| for about a year now. fantastic. can't say enough good things
| about it.
| gregory9857 wrote:
| Thanks for sharing such great information, I highly appreciate
| your hard-working skills as the post you published have some
| great information which is quite beneficial for me.
|
| https://www.tellthebell.kim/
| NortySpock wrote:
| How does this compare with, say, Wallabag?
| https://github.com/wallabag
|
| Looks like ArchiveBox has more export options? EDIT: looks like
| ArchiveBox is focused on continuous change tracking rather than
| than just snapshots like Wallabag.
| rcarmo wrote:
| Actually, no. You can take more snapshots, but that's just an
| added feature.
| programmarchy wrote:
| Wow, it can even extract video! This looks phenomenal, and a
| great excuse to stock up on more disk space.
|
| Their roadmap is also very interesting: "v2.0 Federated or
| distributed archiving + paid hosted service offering"
|
| https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap#v20-fe...
| nikisweeting wrote:
| Thanks for posting this @rcarmo!
|
| One neat tiny implementation detail of ArchiveBox that I just
| highlighted on HN today is our use of asymptotic progress bars
| when we don't know how long archiving a page is going to take:
| https://news.ycombinator.com/item?id=27860022
| rcarmo wrote:
| Just FYI, I have this set up as a Docker container on my Synology
| and it is now patiently crawling through my (imported) 2000+
| Pocket URLs, to which I'm adding a lot of other stuff scattered
| across other "clipping" tools (like OneNote).
|
| Key benefit for me is having actual local files. The resulting
| PDFs are searchable on their own, so I can sync those back to my
| Mac for reference (and Spotlight indexing). But the HTML
| snapshots are also pretty decent.
|
| One thing I'll be looking into is automatic tagging (since it's a
| Django app there are plenty of likely ways to inject that info).
| res0nat0r wrote:
| I just got my first Synology literally two days ago, a
| DS3617xsII. Looking forward to playing with it, especially the
| virtualization / Docker features. How do you like it?
| rcarmo wrote:
| Very much so, this is my second and I've had it for a year
| now: https://taoofmac.com/space/blog/2020/04/04/2310
| res0nat0r wrote:
| Cool post. Sounds like it has a lot of cool features,
| looking forward to messing with it over the weekend.
___________________________________________________________________
(page generated 2021-07-16 23:03 UTC)