[HN Gopher] ArchiveBox/ArchiveBox: open-source self-hosted web a...
       ___________________________________________________________________
        
       ArchiveBox/ArchiveBox: open-source self-hosted web archiving
        
       Author : rcarmo
       Score  : 181 points
       Date   : 2021-07-15 09:07 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | encryptluks2 wrote:
       | What is the advantage of this over something like Kiwix, or just
       | using Playwright CLI? It seems useful but a bit unnecessary if
       | just using to create Archive.org links.
        
         | rcarmo wrote:
         | This does _a lot more_ than just creating archive.org links. It
         | saves the entire page contents (in multiple formats, including
         | nicely searchable PDF and some embedded media) locally.
        
         | soneil wrote:
         | I use it like permanent bookmarks. I can go back to it and
         | trust it'll still be there. And this isn't just "things
         | disappear eventually" - for a specific example, I was working
         | on something rather last-minute recently, and wanted to refer
         | to a vendor's whitepaper - and their entire site was "down for
         | maintenance" all weekend. So I went onto my archive and I still
         | had a copy from my first pass over the topic. It's a free
         | resource, I can't blame them and I can't complain - but if I
         | can make sure it doesn't impact me, even better.
         | 
         | I know other people would still have that tab open from 3 weeks
         | ago, but I just don't work like that.
         | 
         | I'm not going to complain about wayback/archive.org at all, but
         | the nature of the beast is that there's certain requests they
         | have to obey - and with my own offline, non-exposed equivalent,
         | I don't (well, I do, but I simply don't receive them)
        
           | sleavey wrote:
           | > I use it like permanent bookmarks.
           | 
           | I create browser bookmarks regularly but I wouldn't be
           | bothered to SSH into my server to also tell it to grab a copy
           | of the URL. Automating this with a browser plugin would be
           | cool.
        
             | soneil wrote:
             | There is a web UI, so I have a JS "bookmarklet" in my
             | toolbar. If you go to the web UI and Add a url, right at
             | the bottom of the page there's a link you can bookmark to
             | do the same. So that's my workflow - click Archive (or cmd-
             | alt-1), then hit enter. Done.
        
               | sleavey wrote:
               | Aha, that's great! I might give this a go this weekend
               | then.
        
               | nikisweeting wrote:
               | There's also a real browser extension in the works by one
               | of our users: https://github.com/ArchiveBox/ArchiveBox/is
               | sues/577#issuecom...
        
         | braincoke wrote:
         | Archivebox is also an awesome tool to create copies of a
         | website. Whether you want to demonstrate a phishing attack or
         | do a POC to integrate your product.
        
       | rambambram wrote:
       | I really need to try this out soon. I keep bumping into it
       | online. What do people on HN use it for?
       | 
       | What I do right now "to collect, save, and view sites you want to
       | preserve offline" is by use of a Firefox plugin called
       | WebScrapBook. Click-click-done, and I have a local searchable (!)
       | copy of a webpage exactly as it looked in the browser. With
       | styles and all, in one file. WebScrapBook is pretty highly
       | configurable.
       | 
       | In the future I would like to have a solution that doesn't
       | require some Firefox plugin.
        
         | thedanbob wrote:
         | I've used it to save a lot of pages related to ham radio. I
         | have several 30-40 year old radios and I'm afraid one day
         | information about them will just drop off the internet.
        
           | rambambram wrote:
           | Cool! How is the saving experience? Is it quick? And looking
           | up? Or is it more just 'saving for later use'?
        
             | thedanbob wrote:
             | Fairly quick, depending of course on the weight of the
             | page(s) and how many archive methods you enable (I have
             | several turned off because they are redundant for my
             | purposes). The "adding to archive" interstitial still tends
             | to time out which is a little annoying, but the actual
             | archive process is backgrounded so it doesn't matter.
             | 
             | The recommended install includes a search engine that works
             | well, aside from a few false positives due to being fuzzy
             | search. I don't have much in it yet so I can't say what the
             | performance is like once you reach e.g. thousands of pages,
             | but I imagine it would still perform well except maybe for
             | mass operations like rebuilding the entire index.
        
       | infogulch wrote:
       | I think it would be neat to try integrating ArchiveBox with the
       | enhanced history / browsing context extension Promnesia [1].
       | 
       | [1]: https://github.com/karlicoss/promnesia
        
       | blastro wrote:
       | i have been archiving cool or interesting sites with this tool
       | for about a year now. fantastic. can't say enough good things
       | about it.
        
       | gregory9857 wrote:
       | Thanks for sharing such great information, I highly appreciate
       | your hard-working skills as the post you published have some
       | great information which is quite beneficial for me.
       | 
       | https://www.tellthebell.kim/
        
       | NortySpock wrote:
       | How does this compare with, say, Wallabag?
       | https://github.com/wallabag
       | 
       | Looks like ArchiveBox has more export options? EDIT: looks like
       | ArchiveBox is focused on continuous change tracking rather than
       | than just snapshots like Wallabag.
        
         | rcarmo wrote:
         | Actually, no. You can take more snapshots, but that's just an
         | added feature.
        
       | programmarchy wrote:
       | Wow, it can even extract video! This looks phenomenal, and a
       | great excuse to stock up on more disk space.
       | 
       | Their roadmap is also very interesting: "v2.0 Federated or
       | distributed archiving + paid hosted service offering"
       | 
       | https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap#v20-fe...
        
       | nikisweeting wrote:
       | Thanks for posting this @rcarmo!
       | 
       | One neat tiny implementation detail of ArchiveBox that I just
       | highlighted on HN today is our use of asymptotic progress bars
       | when we don't know how long archiving a page is going to take:
       | https://news.ycombinator.com/item?id=27860022
        
       | rcarmo wrote:
       | Just FYI, I have this set up as a Docker container on my Synology
       | and it is now patiently crawling through my (imported) 2000+
       | Pocket URLs, to which I'm adding a lot of other stuff scattered
       | across other "clipping" tools (like OneNote).
       | 
       | Key benefit for me is having actual local files. The resulting
       | PDFs are searchable on their own, so I can sync those back to my
       | Mac for reference (and Spotlight indexing). But the HTML
       | snapshots are also pretty decent.
       | 
       | One thing I'll be looking into is automatic tagging (since it's a
       | Django app there are plenty of likely ways to inject that info).
        
         | res0nat0r wrote:
         | I just got my first Synology literally two days ago, a
         | DS3617xsII. Looking forward to playing with it, especially the
         | virtualization / Docker features. How do you like it?
        
           | rcarmo wrote:
           | Very much so, this is my second and I've had it for a year
           | now: https://taoofmac.com/space/blog/2020/04/04/2310
        
             | res0nat0r wrote:
             | Cool post. Sounds like it has a lot of cool features,
             | looking forward to messing with it over the weekend.
        
       ___________________________________________________________________
       (page generated 2021-07-16 23:03 UTC)