[HN Gopher] ArchiveBox is evolving: the future of self-hosted in...
       ___________________________________________________________________
        
       ArchiveBox is evolving: the future of self-hosted internet archives
        
       We've been pushing really hard over the last 6mo to develop this
       release. I'd love to hear feedback from people who've worked on big
       plugin systems in the past, or anyone who's tried our betas!
        
       Author : nikisweeting
       Score  : 404 points
       Date   : 2024-10-16 16:18 UTC (6 hours ago)
        
 (HTM) web link (docs.sweeting.me)
 (TXT) w3m dump (docs.sweeting.me)
        
       | grinch5751 wrote:
       | This looks like a really wonderful set of developments. Already
       | making plans to use an old laptop of mine as an achivebox
       | machine.
        
       | the_gorilla wrote:
       | I don't know how anyone manages to use archivebox. I've tried it
       | twice in the last 3 years and its site compatibility is bad, it
       | quietly leaks everything you archive to archive.org by default,
       | and whenever it fails on a download it stops archiving anything
       | even after deleting and resubmitting all the jobs.
       | 
       | I'm sure it works for _some_ people, but not me.
        
         | nikisweeting wrote:
         | These are legitimate gripes that have plagued specific past
         | releases, I hear your frustration. Please keep in mind this was
         | a solo effort of a single developer, only worked on in my spare
         | time over the last 7 years (up until very recently).
         | 
         | The new v0.8 adds a BG queue specifically to deal with the
         | issue of stalling when some sites fail. There was a system to
         | do this in the past, but it was imperfect and mostly optimized
         | for the docker setup where a scheduler is running `archivebox
         | update` every few hours to retry failed URLs.
         | 
         | Site compability is much improved with the new BETA, but it's a
         | perpetual cat and mouse game to fix specific sites, which is
         | why we think the new plugin system is the way forward. It's
         | just not sustainable for a single company (really just me right
         | now) to maintain hundreds of workarounds for each individual
         | site. I'm also discussing with the Webrecorder and Archive.org
         | teams how we can to share these site-specific workarounds as
         | cross-compatible plugins (aka "behaviors") between our various
         | software.
         | 
         | > it quietly leaks everything you archive to archive.org by
         | default
         | 
         | It's prominently mentioned many times (at least 4) on our
         | homepage that this is the default, and archiving public-only
         | sites (which are already fair game for Archive.org) is a
         | default for good reason. Archiving private content requires
         | several important changes and security considerations. More
         | context: https://news.ycombinator.com/item?id=26866689
        
           | freedomben wrote:
           | Yeah, I'm not sure whether archive.org should be defaulted to
           | on or off (I see both sides of that one), but its existence
           | is definitely surfaced.
           | 
           | I love Archive Box btw, thank you for your effort! It's
           | filling a very important need.
        
           | the_gorilla wrote:
           | I can accept the other issues, but archivebox needs be
           | private and secure by default.
           | 
           | Sending everything to archive.org is bad default value and it
           | erodes a certain level of trust in the project. Requiring
           | "several important changes and security considerations" just
           | makes a non-starter. The default settings should be "safe"
           | for the default user, because as you mentioned in that post,
           | 90% of users are never going to change them. Users should be
           | able to run it locally and archive data without worrying
           | about security issues, unless you only want experts to be
           | able to use your software.
           | 
           | Also a contradiction between your statement and your
           | blogpost, someone saving their photos isn't going to be
           | _want_ to worry about whether they configured your tool
           | correctly or leaking all the group logs or grandma 's photos.
           | 
           | >It's prominently mentioned many times (at least 4) on our
           | homepage that this is the default, and archiving public-only
           | sites (which are already fair game for Archive.org) is a
           | default for good reason. Archiving private content requires
           | several important changes and security considerations. More
           | context
           | 
           | > Who cares about saving stuff?
           | 
           | > All of us have content that we care about, that we want to
           | see preserved, but privately:
           | 
           | > families might want to preserve their photo albums off
           | Facebook, Flickr, Instagram
           | 
           | > individuals might want to save their bookmarks, social
           | feeds, or chats from Signal/Discord
           | 
           | > companies might want to save their internal documents, old
           | sites, competitor analyses, etc.
           | 
           | I want the project to do well but it really needs to be
           | secure by default.
        
             | nikisweeting wrote:
             | > The default settings should be "safe" for the default
             | user,
             | 
             | I 100% agree, but because private archiving is doable but
             | NOT 100% safe yet I cant make that mode the default. The
             | difficult reality currently is that archiving anything non-
             | public is not simple to make safe.
             | 
             | Every capture will contain reflected session cookies,
             | usernames, and PII, and other sensitive content. People
             | don't understand that this means if they share a snapshot
             | of one page they're potentially leaking their login
             | credentials for an entire site.
             | 
             | It is possible to do safely, and we provide ways to achieve
             | that that I'm constantly working on improving, but until
             | it's _easy_ and straightforward and doesn 't require any
             | user education on security implications, I cant make it the
             | default.
             | 
             | The goal is to get it to the point where it CAN be the
             | default, but I'm still at least 6mo away from that point.
             | Check out the archivebox/sessions dir in the source code
             | for a look at the development happening here.
             | 
             | Until then, it requires some user education and setting up
             | a dedicated chrome profile + cookies + tweaking config to
             | do. (as an intentional barrier to entry for private
             | archiving)
        
               | bigiain wrote:
               | That's a really good response, thanks.
               | 
               | I've been very impressed by all of your responses in
               | here, but that one in particular shows empathy,
               | compassion, and a deep deep subject matter expertise.
        
             | hobs wrote:
             | As a custom tool built to archive stuff for archive.org,
             | why would you expect that it can also do a completely
             | opposite task, saving information privately?
             | 
             | I can see why you would want such a tool, but it seems like
             | a direct divergence from the core goal of the existing
             | codebase.
        
               | the_gorilla wrote:
               | [flagged]
        
               | dang wrote:
               | We've banned this account for breaking the site
               | guidelines. Please don't create accounts to break HN's
               | rules with.
               | 
               | https://news.ycombinator.com/newsguidelines.html
        
             | Apocryphon wrote:
             | Perhaps this data is "private" as in "personal property"
             | and not "private" as in "confidential."
        
               | nikisweeting wrote:
               | It's intended for both but it currently requires extra
               | setup to do "confidential" because there are security
               | risks.
        
       | toomuchtodo wrote:
       | https://github.com/ArchiveTeam/grab-site might be helpful. I'm a
       | fan of the ability to create WARC archives from a target, uploard
       | the WARC files to object storage (whether that is IA, S3,
       | Backblaze B2, etc), and then keep them in cold storage or serve
       | them up via HTTPS or a torrent (mutable, preferred). The Internet
       | Archive serves a torrent file for every item they host; one can
       | do the same with WARC archives to enable a distributed archive.
       | CDX indexes can be used for rapidly querying the underlying WARC
       | archives.
       | 
       | You might support cryptographically signing WARC archives;
       | Wayback is particular about archive provenance and integrity, for
       | example.
       | 
       | https://www.loc.gov/preservation/digital/formats/fdd/fdd0005...
       | ("CDX Internet Archive Index File")
       | 
       | https://www.loc.gov/preservation/digital/formats/fdd/fdd0002...
       | ("WARC, Web ARChive file format")
       | 
       | https://github.com/internetarchive/wayback/tree/master/wayba...
       | ("Wayback CDX Server API - BETA")
        
         | nikisweeting wrote:
         | I recommend Browsertrix for WARC creation, I think they are the
         | best currently available for WARC/WACZ.
         | 
         | ArchiveBox is also gearing up to support _real_ cryptographic
         | signing of archives using https://tlsnotary.org/ in an upcoming
         | plugin. (in a way that actually solves the TLS non-repudation
         | issue, which traditional "signing a WARC" does not, more info:
         | https://www.ndss-symposium.org/wp-
         | content/uploads/2018/02/nd...)
        
           | toomuchtodo wrote:
           | Keep in mind, what signing methodology you use is a function
           | of who accepts it. If I can confirm "ArchiveTeam ripped
           | this", that is is superior to whatever tlsnotary is doing
           | with MPC, blockchain, distributed ledger, whatever (in my use
           | case). Have to trust someone at the end of the day.
           | ArchiveTeam's Warrior doesn't use tlsnotary, for example, and
           | rips entire sites just fine.
        
             | nikisweeting wrote:
             | The idea with TLSNotary is that you can have several
             | universities or central agencies running signing servers
             | but you dont have to share the cleartext content of your
             | archives with them to get it signed.
             | 
             | This dramatically changes what is possible with signing
             | because previously to get ArchiveTeam's signature of
             | approval, they would have to see the content themselves to
             | archive it. With TLSNotary they can sign without needing to
             | see the content/access the cookies/etc.
        
               | viraptor wrote:
               | Isn't that already possible with any kind of notary by
               | giving them a sha256 of the content only? Or am I missing
               | some distinction?
        
               | nikisweeting wrote:
               | You can do that but it proves nothing because TLS session
               | keys are symmetric, so the archiver can forge server
               | responses and falsely attest that the server sent them.
               | 
               | Look up "TLS non repudiation"
               | 
               | A real solution like TLSNotary involves a neutral,
               | reputable third party that can't see the cleartext
               | attesting to the cyphertext using a ZK proof.
               | 
               | The neutral third party doing attestation can't see the
               | content so they can't easily tamper with it, and attempts
               | to tamper indiscriminately would be easily detected and
               | ding their reputation.
        
         | pzmarzly wrote:
         | Can you recommend some tools to manage mutable torrents? I.e.
         | create them, edit them, download them and keep them downloaded
         | up to date.
         | 
         | BTW I recently tried using IPFS for a mutable public storage
         | bucket and that didn't go well - downloads were very slow
         | compared to torrents, and IPNS update propagation took ages.
         | Perhaps torrents will do the job.
        
           | Apocryphon wrote:
           | Man, looks like the first posts about IPFS cropped up on HN a
           | decade ago. I remember seeing Neocities announcement of
           | support for them. I wonder if that protocol has gotten
           | anywhere since then.
        
           | nikisweeting wrote:
           | My plan is to use a separate control plane for the
           | discovery/announcements of changes, and torrents just for the
           | data transfer. The specifics are still being actively
           | discussed, and it's a few releases away anyway.
        
         | 0cf8612b2e1e wrote:
         | The Internet Archive serves a torrent file for every item they
         | host
         | 
         | I had no idea. I have found the IA serving speed to be pretty
         | terrible. Are the torrents any better? Presumably the only ones
         | seeding the files are IA themselves.
        
           | toomuchtodo wrote:
           | The benefit is not in seeding speed directly from IA, but the
           | potential for distributed access and seeding of the item.
           | Think of it as a filename of a zip file in a flat distributed
           | filesystem, with the ability to cherrypick files that make up
           | the item out via traditional bittorrent mechanisms. Anyone
           | can consume each item via torrent, continue to seed, and then
           | also access the underlying data. IA acts as the storage
           | system of last resort (and the metadata index).
        
           | pabs3 wrote:
           | The torrents have better speeds because they have WebSeeds
           | for multiple IA servers, so you can download from multiple
           | servers at once.
        
       | treyd wrote:
       | Is this a project that could be developed to support a
       | distributed mirror of archive.org similar to how Anna's Archive
       | works?
        
         | nikisweeting wrote:
         | Yeah that's what we're aiming for eventually, but with the
         | addition of fine-grained permissions controls so you don't
         | _have_ to share everything 100% publically, you can choose a
         | subset.
         | 
         | https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap
        
       | nfriedly wrote:
       | I've been using an instance of https://readeck.org/ for personal
       | archives of web pages and I really like it, but I might try out
       | ArchiveBox at some point too.
       | 
       | I also run an instance of ArchiveTeam Warrior which is constantly
       | uploading things to archive.org, and I like the direction
       | ArchiveBox is heading with the distributed/federated archiving on
       | the roadmap, so I may end up setting up an instance like that
       | even if I don't use it for personal content.
        
         | nikisweeting wrote:
         | I love ArchiveTeam warrior, it's such a good idea! We run
         | several instances ourselves, and it's part of our Good Karma
         | Kit for computers with spare capacity:
         | https://github.com/ArchiveBox/good-karma-kit
         | 
         | There are a bunch of other alternatives like ReadDeck listed on
         | our wiki too, we encourage people to check it out!
         | 
         | https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...
        
         | venusenvy47 wrote:
         | I've been using the Single File extension to save self-
         | contained html files of pages I want to keep for posterity. I
         | like it because any browser can open the files it creates. Is
         | it easy to view the archive files from readeck? I haven't
         | looked at fancier alternatives to my existing solution.
         | 
         | https://addons.mozilla.org/en-US/firefox/addon/single-file/
        
           | nfriedly wrote:
           | I haven't looked at the on-disk format, I just use the
           | browser interface. (It's fairly common for me to save
           | something from my phone that I'll want to review on a
           | computer later.)
           | 
           | Here's an example of an Amazon "review" I recently archived
           | that has instructions for using a USB tester I have:
           | https://readeck.home.nfriedly.com/@b/tCngVjkSFOrCbwb9DnY2yw
           | 
           | And, for comparison, here's the original:
           | https://www.amazon.com/gp/customer-reviews/R3EF0QW6MAJ0VP
           | 
           | It'd be nice if I could edit out the extra junk near the top,
           | but the important bits are all there.
        
             | ashildr wrote:
             | I was about to post a link to the same URL but archived
             | using singleFile, which looks like the original at amazon.
             | I didn't because I realized that I have absolutely no idea
             | what additional information would be hidden in the file. In
             | the worst case any component sent by Amazon and archived
             | into the file may contain PII, even if I am "logged out".
             | 
             | I'm not saying that singleFile is bad in any way, I'm using
             | it a lot on multiple devices, but I'm not sure whether
             | sharing archives is a good idea(tm).
        
               | nikisweeting wrote:
               | 100%, this is the challenge of archiving logged in
               | content.
               | 
               | It becomes un-shareable unless we use fake burner
               | accounts for capture, or have really good sanitizing
               | methods.
        
               | ashildr wrote:
               | Even when I'm logged out I expect at least information on
               | my geographical location to seep into the archive via
               | URLs addressing specific CDN endpoints or similar
               | mechanisms.
        
               | nikisweeting wrote:
               | Yup, this is why the ArchiveBox browser extension sends
               | URLs to a separate server for archiving with an isolated
               | burner profile.
               | 
               | I should write a full article on the security
               | implications at some point, there aren't many good top-
               | down explanations of why this is a hard problem.
        
           | nikisweeting wrote:
           | Singlefile is excellent, Gildas is a great developer.
           | ArchiveBox has had singlefile as one of its extractors built
           | in for years :)
        
       | wongarsu wrote:
       | Does this mean it's now possible to write plugins that dismiss
       | cookie popups, solve captchas, scroll web pages etc.?
        
         | nikisweeting wrote:
         | I have a private plugin with puppeteer support for stuff like
         | this, currently charging clients money to use it to fund the
         | open source development. The clients are people who are already
         | legally allowed to evade CAPTCHAS (e.g. governments, NGOs doing
         | research, lawyers collecting evidence, etc.)
         | 
         | Unfortunatley I cant open source the CAPTCHA solving stuff
         | myself, because it opens me up to liability, but if someone
         | wants to contribute a plugin to the ecosystem I cant stop them
         | ;).
        
           | 0x1ch wrote:
           | Legally allowed to evade CAPTCHAs? LOL.
           | 
           | What world do we live in where evading a captcha is an
           | illegal offense?
        
             | nikisweeting wrote:
             | It doesn't matter whether or not it's actually legal, what
             | matters is that the big platforms will sue you for trying,
             | so you need a big bankroll to stand your ground.
             | 
             | At the very least they can bar you from accessing their
             | sites as you're violating ToS that you accept upon signup.
        
       | sagz wrote:
       | Do y'all support archiving pages that are behind logins? Like
       | using browser cookies?
        
         | markerz wrote:
         | Yes, but there's security concerns where you might accidentally
         | leak your credentials / cookies if you publish your archive to
         | the public.
         | 
         | https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...
         | 
         | PS. I'm an archivebox user, not a dev or maintainer.
        
           | nikisweeting wrote:
           | Yes this is correct, with plans to make this easier in the
           | near future via setup wizard that guides you through creating
           | dedicated credentials for archiving.
        
       | millvalleydev wrote:
       | For devs like us, archivebox? or browsertrix-crawler? for
       | scraping entire sites for our own uses, maybe to keep contents
       | behind pay walls while we have subscriptions or maybe to feed
       | them to local LLMs to ask?
        
         | nikisweeting wrote:
         | For scraping entire sites browserteix is currently more suited
         | until we add full depth recursive crawling in v0.9. For feeding
         | to LLMs ArchiveBox MIGHT BE better (imho) because we extract
         | the raw content and you likely don't need the whole WARC.
        
       | bravura wrote:
       | @nikisweeting ArchiveBox is awesome and we'd really love it to be
       | more awesome. And sustainable!
       | 
       | I've posted issues and PRs for showstopper issues that took
       | months to get merged in:
       | https://github.com/ArchiveBox/ArchiveBox/issues/991
       | https://github.com/ArchiveBox/ArchiveBox/pull/1026
       | 
       | You have the opportunity for the community to lean in on
       | ArchiveBox. I understand it's hard to do everything as a solo
       | dev, we've seen many cases in the community where solo devs get
       | burned out or have personal challenges that take priority etc.
       | 
       | It's hard for us users to lean in on ArchiveBox when after a
       | happy month of archiving, things start break and you're left with
       | maintaining a branch of your own fixes that aren't in main.
       | Meanwhile, your solution of soliciting one time donations just
       | makes the whole project feel more rickety and fly-by-night. How
       | about thinking bigger?
       | 
       | We NEED ArchiveBox to be a real thing. Decentralized tooling for
       | archiving is SO IMPORTANT. I care about it and I suspect many
       | people do. I'm posting this so other people who care about it can
       | also comment and chime in and suggest how it can become something
       | we can rely on. Because archiving isn't just about the past, it's
       | about the future.
       | 
       | Maybe it needs to be a dev org of three committed part-time
       | maintainers, and a small foundation that people recurrently
       | support is what grants it? IDK. I'm not an expert at how to make
       | open source resilient. There have been discussions about this in
       | the past, but I think it's worth a serious look because
       | ArchiveBox is IMPORTANT and I want it to work any month I decide
       | to re-activate my interest in it. I invite people to discuss ways
       | to make this valuable project more sustianable and resilient.
        
         | nikisweeting wrote:
         | Let chat more. I'm almost ready to raise some seed money, hire
         | a second staff dev or find a cofounder, and I'm looking for
         | people that care deeply about the space.
         | 
         | It's only been during the last few months that I decided to go
         | all in on the project, so this is still just the first few
         | pages of a new chapter in the project's history.
         | 
         | (I should also mention that if you're a commercial entity
         | relying on ArchiveBox, you can hire us for dedicated support
         | and uptime guarantees. We have a closed source fork that has a
         | much better test suite and lots of other goodies)
        
           | giancarlostoro wrote:
           | Do you guys have a Discord by chance? I have a close friend
           | who is insanely passionate about archiving, he has a personal
           | instance of archivebox, and is working on a Video Downloading
           | project as well. He has used it almost everyday and archived
           | thousands of news articles over years. He's aware of a lot of
           | the nuances.
        
             | nikisweeting wrote:
             | We have a Zulip which is similar to discord (but self
             | hosted and it has better threading):
             | https://zulip.archivebox.io
        
           | nyx wrote:
           | It looks like you're doing great work here, thanks a bunch;
           | looking forward to seeing this project develop.
           | 
           | Selling custom integrations, managed instances, white-glove
           | support with an SLA, and so on seems like a reasonable
           | funding model for a project based on an open-source, self-
           | hostable platform. But I'm a little disheartened to read that
           | you're maintaining a closed fork with "goodies" in it.
           | 
           | How do you decide which features (better test suite?) end up
           | in the non-libre, payware fork of your software? If someone
           | contributed a feature to the open-source version that already
           | exists in the payware version, would you allow it to be
           | merged or would you refuse the pull request?
        
             | nikisweeting wrote:
             | The idea with the plugin system is that plugins are just
             | git repos containing <pluginname>/__init__.py, and you can
             | add any set of git repo plugins you want to your instance.
             | 
             | The marketplace will work by showing all git repos tagged
             | with the "archivebox" tag on github.
             | 
             | My approval is only needed for PRs to the archivebox core
             | engine.
             | 
             | More info on free vs paid + reasoning why it's not all open
             | source: https://news.ycombinator.com/item?id=41863539
        
           | manofmanysmiles wrote:
           | I love this project. I "independently" "invented" it in my
           | head the other day, and happy to see it already exists!
           | 
           | I'd love to see blockchain proof/notary support. The ability
           | to say "content matching this hash existed at this time.
           | 
           | I'm exceptionally busy now but that being said, I may choose
           | to contribute nonetheless.
           | 
           | I'd love to connect directly, and will connect to the Zulip
           | instance later.
           | 
           | If we align on values, I may be able to connect you with some
           | cash. People often call me an "anarchist" or "libertarian",
           | though I'm just me, not labels necessary.
        
           | bigiain wrote:
           | "I too would like commit access to your promising looking
           | project's git repo and CI/CD pipeline. Thanks, Jia Tan"
        
       | 404mm wrote:
       | Somewhat similar topic, anyone has recommendations for a self-
       | hosted internet website change monitoring system? I've been
       | running Huginn for many years and it works well; however, I have
       | a feeling the project is on its last leg. Also, it's based on
       | either text scraping (XPath/CSS/HTML and rss but it struggles
       | with newer JS-based sites.
        
         | nikisweeting wrote:
         | Changedetection.io
        
           | 404mm wrote:
           | Thank you! That looks great!
        
         | arminiusreturns wrote:
         | Why do you feel like Huginn is on its last leg? It's been in my
         | list of things to play with for years now, but I never got
         | around to it...
        
           | 404mm wrote:
           | It looks like it's being maintained by a single remaining
           | developer. No new features are being added, just some basic
           | maintenance. The product as a whole still works well, so
           | unless you find something better, I do recommend it. I run it
           | in k3s and the image is probably the easiest way of
           | maintaining it.
        
       | Acrobatic_Road wrote:
       | The subline mentions "Auto-login", but the article never
       | elaborates on this. Does this mean we will be able to more easily
       | archive non-public websites?
       | 
       | Also, how do you plan to ensure data authenticity across a
       | distributed archive? For example, if I archive someone's blog,
       | what is stopping me from inserting inflammatory posts that they
       | never wrote, and passing them off as the real deal? Slight
       | update: I see you're using TLS Notary! That's exactly what I
       | would have suggested!
        
         | nikisweeting wrote:
         | Auto log in is currently a service I provide for paying
         | clients, and you can do it in the open source version manually
         | with some extra config.
         | 
         | Working hard on making it more accessible in the future, and
         | plugins should help!
        
       | FiniteField wrote:
       | Disappointing that a project that should ostensibly care about
       | preserving the open, non-centralised internet takes the time to
       | namedrop and talk about making "compromises" against preserving a
       | well-known, medium-sized clearnet forum legally operated from a
       | US-based LLC. Still-living independent forum sites in this day
       | and age have unrivalled SNR of actual human-to-human
       | communication, there should be no better candidate for archival.
       | It's sad that a self-hosted archival tool has to apologise for
       | any "evil" content it might be used for in the first place. Tape
       | recorders do not require a disclaimer about people saying "hate
       | speech" into them.
        
         | nikisweeting wrote:
         | Sorry which medium sized forum are you referring to?
         | 
         | I love forums and want them to continue, I'm not sure where you
         | got the idea that I dislike them as a medium. I was just
         | pointing out that public sites in general have started to see
         | some attrition a bit lately for a variety of reasons, and the
         | tooling needs to keep with new mediums as they appear.
         | 
         | I also make no apology for the content, in fact ArchiveBox is
         | explicitly designed to archive the most vile stuff for lawyers
         | and governments to use for long term storage or evidence
         | collection. One of our first prospective clients was the UN
         | wanting to use it to document Syrian war crimes. The point
         | there was that we can save stuff without amplifying it, and
         | that's sometimes useful in niche scenarios.
         | 
         | Lawyers/LE especially don't want to broadcast to the world (or
         | tip off their suspect) that they are investigating or endorsing
         | a particular person, so the ability to capture without publicly
         | announcing/mirroring every capture is vital.
        
           | dark-star wrote:
           | I guess he's talking about K_wi F_rms which was mentioned in
           | one of the screenshots...
        
       | rodolphoarruda wrote:
       | > "In an era where fear of public scrutiny is very tangible,
       | people are afraid of archiving things for eternity. As a result,
       | people choose not to archive at all, effectively erasing that
       | history forever."
       | 
       | Really? I don't get that feeling at all. I use Evernote to
       | archive anything I consider worth keeping. I wonder where such
       | "fear of archiving" comes from.
        
         | nikisweeting wrote:
         | A lot of people are retreating off public free-for-all
         | platforms like Twitter to more siloed spaces like Discord, for
         | many reasons, not just fear of archiving.
         | 
         | It all has the same effect of making it harder to archive
         | though.
        
       | rcarmo wrote:
       | This is nice. I'm actually much more excited about the REST API
       | (which will let me do searches and pull information out, I hope)
       | than the plugin ecosystem, since the last thing I need is for
       | another tool to have a half-baked LLM integration -- I prefer to
       | do that myself and have full control.
       | 
       | Being able to do RAG on my ArchiveBox is something that I have
       | very much wanted to do for over a year now, and it might finally
       | be within reach without my going and hacking at the archived
       | content tree...
       | 
       | Edit: Just looked at the API schema at
       | https://demo.archivebox.io/api/v1/docs.
       | 
       | No dedicated search endpoint? This looks like a HUGE missed
       | opportunity. I was hoping to be able to query an FTS index on the
       | SQLlite database... Have I missed something?
        
         | nikisweeting wrote:
         | The /cli/list endpoint is the search endpoint you're looking
         | for. It provides FTS but I can make it clearer in the docs,
         | thanks for the tip.
         | 
         | As for the AI stuff don't worry, none of it is touching core,
         | it's all in an optional community plugin only for those who
         | want it.
         | 
         | I'm not personally a huge AI person but I have clients who are
         | already using it and getting massive value from it, so it's
         | worth mentioning. (They're doing some automated QA on thousands
         | of collected captured and feeding results into spreadsheets)
        
           | rcarmo wrote:
           | Thanks, I'll have a look.
           | 
           | My use for this is very different--I want to be able to use a
           | specific subset of my archived pages (which is mostly
           | reference documentation) to "chat" with, providing different
           | LLM prompts depending on subset and fetching plaintext chunks
           | as reference info for the LLM to summarize (and point me back
           | to the archived pages if I need more info).
        
             | nikisweeting wrote:
             | Ok that makes sense, I think archivebox works as the first
             | step in a pipeline there, with some other tool doing the
             | LLM analysis and query stuff.
        
           | sunshine-o wrote:
           | I have been using ArchiveBox recently and love it.
           | 
           | About search, one thing I haven't yet figured out how to do
           | easily is to plug it to my SearXNG instance as they only seem
           | to support Elasticsearch, Meilisearch or Solr [0]
           | 
           | So this new plugin architecture will allow for a meilisearch
           | plugin I guess (with relevancy ranking).
           | 
           | - [0] https://docs.searxng.org/dev/engines/offline/search-
           | indexer-...
        
             | nikisweeting wrote:
             | Definitely doable! Search plugins are one of the first that
             | I implemented.
             | 
             | We already provide Sonic, ripgrep, and SQLiteFTS as
             | plugins, so adding something like Solr should be
             | straightforward.
             | 
             | Check out the existing plugins to see how it's done: https:
             | //github.com/ArchiveBox/ArchiveBox/pull/1534/files?fil...
             | 
             | archivebox/plugins_search/sonic/*
        
       | favorited wrote:
       | As someone who was archiving a doomed website earlier today using
       | wget, I was reminded that really need to get ArchiveBox
       | working...
       | 
       | I used to rely on my Pinboard subscription, but apparently
       | archive exports haven't worked for years, so those days are over.
        
         | nikisweeting wrote:
         | Pocket also doesn't offer archived page exports (or even RSS
         | export). I feel like both are really dropping the ball in this
         | area!
        
         | VTimofeenko wrote:
         | I recently found omnivore.app through HN comments -- works
         | great for sharing a reading list across machines. I am
         | exporting articles through obsidian, but there is an API
         | option. I don't think it supports outbound RSS, but they have
         | inbound RSS(i.e. omnivore as RSS reader) in beta.
        
       | orblivion wrote:
       | Have you (and I wonder the same about archive.org) considered
       | making a Merkle tree of the data that gets archived? Since data
       | (including photos and videos) are getting easier to fake, it may
       | be nice to have a provable record that at least a certain version
       | of the data existed at a certain time. It would be most useful in
       | case of some sort of oppressive regime down the line that wants
       | to edit history. You'd want to publish the tip somewhere that
       | records the time, and a blockchain seems to make the most sense
       | to me but maybe you don't like blockchains.
        
         | nikisweeting wrote:
         | Yup, already doing that in the betas. Thats what I'm referring
         | to as the beginnings of a "content addressable store" in the
         | article.
         | 
         | In the closed source fork we currently store a merkle tree
         | summary of each dir in a dotfile containing the sha256 and
         | blake3 hash of all entries / subdirs. When a result is "sealed"
         | the summary is generated, and the final salted hash can be
         | submitted to Solana or ETH or some other network to attest to
         | the time of capture and the content. (That part is coming via a
         | plugin later)
        
           | orblivion wrote:
           | Wow that's great!
        
         | beefnugs wrote:
         | Not just all that nonsense, but also it makes a lot of sense to
         | share just the parts from a website that matter like a single
         | video etc without having to download an entire archive or the
         | rest of the site
        
           | nikisweeting wrote:
           | $ archivebox add --extractor=media,readability https://...
           | 
           | We try to make that easy by allowing ppl to select one or
           | more specific archivebox extractors when adding, so you don t
           | have to archive everything every time.
           | 
           | Makes it more useful for scraping in a pipeline with some
           | other tools.
        
       | petertodd wrote:
       | You really should add timestamping to ArchiveBox. The easiest way
       | to do that would be via my OpenTimestamps protocol,
       | https://opentimestamps.org It's open source and free to use, and
       | uses Bitcoin for the actual timestamps. Users of it do _not_ need
       | to make Bitcoin transactions themselves as a set of community
       | calendar servers do that for you. You also don 't need a Bitcoin
       | node to create an OTS timestamp, and you can validate an OTS
       | timestamp without a Bitcoin node as well by trusting someone else
       | to do that for you.
       | 
       | The big thing that ArchiveBox can't do, and the Internet Archive
       | can, is attest to the accuracy of the archive. Being at least
       | able to prove that the archive was created in the past, prior to
       | there being a reason to tamper it, is the best we can
       | realistically do with current cryptography. So it'd be really
       | good if support for timestamping was added.
       | 
       | IIUC ArchiveBox is written in Python; OTS has a Python library
       | that should work fine for you:
       | https://github.com/opentimestamps/python-opentimestamps
        
         | nikisweeting wrote:
         | We're going to add TLSNotary support for real cryptographic
         | signing, see my comments below :)
         | 
         | Timestamping is also on my roadmap, definitely as a plugin (and
         | likely paid) as it's more corporate users that really need it.
         | We need to keep some of the really advanced attestation
         | features paid to be able to support the rest of the business.
        
           | mikae1 wrote:
           | Thanks for the box!
           | 
           | Any examples of other possible really advanced features that
           | might go for-pay?
           | 
           | Is there any chance you will make current free features for-
           | pay? That'd be rather off-putting for me as a home user.
        
             | nikisweeting wrote:
             | No, everything currently free will stay free.
             | 
             | The paid stuff currently is:
             | 
             | - per-user permissions & groups
             | 
             | - audit logging
             | 
             | - auto CAPTCHA solving
             | 
             | - burner credential management for FB/Insta/Twitter/etc. w/
             | auto phone based account verification ability
             | 
             | - custom JS scripts for expanding comments, hiding pop ups,
             | etc.
             | 
             | - managed hosting + support
             | 
             | Some of this stuff ^ is going to become free in upcoming
             | releases, some will stay paid. What I decide to make free
             | is mostly based on abuse potential and legal ramifications,
             | I'd rather have a say in how the risky stuff is used so
             | that it doesn't become a tool weaponized for botting.
        
         | jasonfarnon wrote:
         | I always wonder about this when someone gets in hot water based
         | on something on the wayback machine and the person says the
         | archive was tampered with. Can you elaborate on "prove that the
         | archive was created in the past, prior to there being a reason
         | to tamper it"? What exactly does opentimestamps certify?
        
           | nikisweeting wrote:
           | OpenTimestamps alone can not currently prove anything because
           | TLS session keys are symmetric. The client can forge anything
           | and attest to it falsely. Unless you 100% trust the archiver
           | (in which case you can trust their timestamps), you need
           | TLSNotary or another reputable third party in the loop as a
           | bare minimum.
           | 
           | But more critically: currently the legal standard for
           | evidence is... screenshots. We have a lot of educating work
           | to do before the public understands the value of attestation
           | and signing.
        
       | dark-star wrote:
       | Some time ago I installed ArchiveBox on a RaspberryPi 4 running
       | k3s (a lightweight Kubernetes distro).
       | 
       | I have documented that here:
       | https://darkstar.github.io/2022/02/07/k3s-on-raspberrypi-at-...
       | 
       | Note that this was a rather old version and some things have
       | probably changed compared to now, so YMMV, but it might still
       | provide a good reference for those who want to try
        
         | nikisweeting wrote:
         | Thanks for making that tutorial!
         | 
         | Happy to report that most of the quirks you cover have been
         | improved:
         | 
         | - uid 999 is no longer just enforced, you can pass any
         | PUID:GUID now (like Linuxserver.io containers)
         | 
         | - it now accepts ADMIN_USERNAME + ADMIN_PASSWORD env vars to
         | create an initial admin user on first start without having to
         | exec
         | 
         | - archivebox/archivebox:latest is 0.7.2 (yearly stable release)
         | and :dev is the 0.8.x pre-release updated daily. All Images are
         | all amd64 & arm64 compatible.
         | 
         | - singlefile and sonic are now included in all images &
         | available on all platforms amd64/arm64
        
           | dark-star wrote:
           | yeah I really need to update that guide. Since I published it
           | I have updated ArchiveBox locally to a newer version but
           | never bothered to update the guide :)
        
       | A4ET8a8uTh0 wrote:
       | Those additions are welcome, but if I could request one -- I will
       | that it is very consistently requested -- feature:
       | 
       | - backing up an entire page
       | 
       | Yes, it is hard. Yes, for non-pure html pages is extra kind of
       | painful, but that would honestly making archivebox go from nice
       | to have to.. yes, I have an actual archive I can use when stuff
       | goes down.
        
       | bityard wrote:
       | So, after reading through the comments and website, I just
       | realized I used ArchiveBox a month or two ago for a very specific
       | purpose.
       | 
       | You see, I inherited a boat.
       | 
       | This boat belonged to my father. He was not materialistic but he
       | took very good care of the things he cared about, and he cared
       | about this boat. It's an old 18' aluminum fishing/cruising boat
       | built in the early 1960's. It's not particularly valuable as a
       | collectible but it is fairly rare and has some unique
       | modifications. I spent a lot of time trying to dig up all of the
       | info that I could on it, but this is one of those situations
       | where most of the companies involved have been gone for decades
       | and most everyone who was around when these were made are either
       | dead or not really on the Internet.
       | 
       | It's a shame that I waited so long to start my research because
       | 10 or 20 years ago, there were quite a few active web forums
       | containing informational/tutorial threads from the proud owners
       | of these old boats. I know because I have seen references to
       | them. Some of the URLs are in archive.org, some are not. But the
       | forums are gone, so a large chunk of knowledge on these boats is
       | too, probably forever.
       | 
       | I did manage to dig up some interesting articles, pictures, and
       | forum threads and needed a way to save them so that they didn't
       | disappear from the web as well. There is probably an easier way
       | to go about it, but in the end I ran ArchiveBox via Docker and
       | set it to fetching what I could find and then downloaded the
       | resulting pages as self-contained HTML pages.
        
       | pabs3 wrote:
       | Unfortunately ArchiveBox uses wget, so it produces non-standard
       | WARC files. Sadly there are lots of things like this in the WARC
       | ecosystem.
       | 
       | https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem
        
         | nikisweeting wrote:
         | Yes, this is true currently. If you need nice WARCs I recommend
         | Browsertrix by our friends at Webrecorder instead.
         | 
         | Its on my roadmap to improve this eventually, but currently I'm
         | focused on saving raw files to a filesystem, because it's more
         | accessible to most users, and easier to pipe into other tools.
         | 
         | I encourage people to use ZFS to do deduping and compression at
         | the filesystem layer.
        
       ___________________________________________________________________
       (page generated 2024-10-16 23:00 UTC)