[HN Gopher] Wayback: Self-hosted archiving service integrated wi...
___________________________________________________________________
Wayback: Self-hosted archiving service integrated with Internet
Archive
Author : thunderbong
Score : 217 points
Date : 2023-04-16 02:56 UTC (20 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| JamesAdir wrote:
| Any recommendation for a tool that can crawl and download an
| entire website complely and save it locally?
| thunderbong wrote:
| HTTrack Website Copier
|
| https://www.httrack.com/
| JamesAdir wrote:
| thanks, tried it, but it has a problem with Unicode URL's
| unless there is some setting I'm missing.
| gala8y wrote:
| try adding +*{unicode} to scan rules. seriously, please try
| and let me know, if it worked.
| arboles wrote:
| Browsertrix or Brozzler, which crawl using headless browsers
| for accuracy.
| [deleted]
| 9dev wrote:
| > Wayback is a tool that supports running as a command-line tool
| and docker container, purpose to snapshot webpage to time
| capsules. > > Supported Golang version: See
| .github/workflows/testing.yml
|
| The summary is hilarious! I still have not the slightest idea
| what it does, why I should care, or what it's good for. What are
| "time capsules", and what does "snapshot" mean in this context?
| joshspankit wrote:
| It seems to make the assumption that people already know how
| the Internet Archive's "wayback" service works, or at least
| what it's purpose is.
|
| The essence is that web pages change over time and can be taken
| down or otherwise lost. A "snapshot" is a capture a webpage at
| a specific time, and I assume that "time capsule" is some type
| of format that holds the snapshot as well as the extra
| metadata. The result is something that can be used later to see
| that website as it was.
| squarefoot wrote:
| Not the best explanation for sure. It seems a tool that can be
| used to offload and potentially decentralize some archiving
| work from the Internet Archive, with some privacy/anonymity
| added to help against censorship. About time I'd say, however
| I'm not sure if that can be used for bare files as well: books,
| retrogaming, software, etc.
| angelmm wrote:
| I love all these kind of projects as I tend to be paranoid of
| losing good online content.
|
| It's also unclear to me how wWayback works. It seems more like an
| API than a self-hosted service.
|
| I'm currently using ArchiveBox [0], which provides a complete API
| + UI.
|
| - [0] https://archivebox.io/
| rzzzt wrote:
| Are you using all extractors when saving a page?
|
| I tried ArchiveBox and Shiori, but neither stuck for some
| reason. The latter is a bit more lightweight, it can save the
| entire page as well as a Readability-based conversion:
| https://github.com/go-shiori/shiori/
| Linux-Fan wrote:
| I am not angelmm, but another happy ArchiveBox user.
|
| My choice of extractors is the following: Singlefile, PDF,
| Screenshot, archive.org.
|
| I found the largest issue with any website archiving tools to
| be the discrepancy between what I see in my Web Browser and
| what is saved. The most "reliable" way that still works today
| for me seems to be the "Save Page WE" Firefox plugin.
|
| I have a sidecar container running that checks for HTML files
| appearing in a directory, triggers the archivebox save and
| then overwrites the "singlefile" capture by the provided HTML
| file. This way, I can trigger archiving by just using the
| Save Page WE plugin and storing the resulting HTML file in
| the directory.
| zerkten wrote:
| Can archive box aggregate content "bookmarked" in different
| places?
|
| I want a tool that will pull saved items on Reddit,
| favorite posts on HN, etc. in addition to bookmarks posted
| to pinboard.in to a single place. In many ways, the share
| functionality in iOS allows me to get all URLs into a
| single place, but this doesn't help on desktop.
|
| I know this isn't necessarily and easy task, if APIs aren't
| available. I'd be OK with a client component or browser
| extension, if it was open source and self-hostable.
| Linux-Fan wrote:
| AFAIK it does not contain any function to support this
| directly.
|
| My primary way to interact with Archive Box for such
| purposes is by calling it on the command line. Scripts
| may be used to obtain the URLs of interest from any
| source.
|
| When I started with Archive Box I had some existing
| downloaded Websites from ScrapBook and Save Page WE
| already. I used some hacky scripts to extract the URLs
| from the respective pages and overwrite Archive Box'
| downloaded copies by my original copies as to make it
| work for pages that had been deleted in the meantime. All
| my data sources were local/desktop though.
| Handprint4469 wrote:
| ArchiveBox also saves a Readability version:
|
| > Article Text: article.html/json Article text extraction
| using Readability & Mercury [0]
|
| [0]: https://archivebox.io/#output-formats
| unintendedcons wrote:
| For archiving, look into https://github.com/dosyago/DiskerNet
|
| It's real next gen thinking on this topic.
|
| As for the featured tool wayback... If HN readers can't figure
| out what it does after reading docs, its likely the thinking
| behind it is equally unclear.
| mellosouls wrote:
| Does it actually record web archives (ie warcs)? I couldn't
| work out from a quick look at the repo whether it does that or
| not, though it claims to make your archives shareable.
|
| There's a long-existing web archiving ecosystem with
| established formats for recording and publishing archives.
|
| The repo linked could probably do with some clarity itself in
| how it does or doesn't fit in with standards.
| uniqueuid wrote:
| diskernet is certainly interesting because it records archives
| as you browse.
|
| But it's not open source and pretty limited regarding the use
| case.
|
| The thing is, archiving is a multi-faceted and hard problem
| (i.e. video content, live streams, interactive sites), and will
| remain so. A complicated task leads to complicated tools.
| jsiepkes wrote:
| DiskerNet looks cool but is apparently commercial software?
| Even for personal use you need a license I think?
| gala8y wrote:
| Looking at the link you gave does not help much in seeing what
| DiskerNet does and looks like, neither.
|
| Keeping it simple, I download pages in Markdown adding some
| metadata (some tags). When I want images or more I use
| singlefile extension. Add Recoll to the mix and that's all I
| need.
|
| https://github.com/deathau/markdownload
|
| https://github.com/gildas-lormeau/SingleFile
|
| https://www.lesbonscomptes.com/recoll/pages/index-recoll.htm...
| kuschkufan wrote:
| What about videos (embedded or not)?
| gala8y wrote:
| yt-dlp is good enough for most cases for me.
| pdimitar wrote:
| Only 3 matches of the word "local" in the README and they don't
| seem to refer to any self-hosting whatsoever.
|
| What does this tool do exactly? Send URLs to remote services for
| them to do snapshots?
|
| That README is a classic case of a filter bubble syndrome.
| People, it can't physically hurt you to add 1-2 sentences saying
| what _exactly_ does your tool do!
| pmontra wrote:
| It also stores locally. One of the last bullet points is
|
| > Supports storing archived files to disk for offline use
| pdimitar wrote:
| Ah, I see it now, thanks. IMO it should be put front and
| center and be at the start of the project's description.
| polynox wrote:
| I would not complain about a free open source tool because they
| don't do a good job of marketing it.
| pdimitar wrote:
| They can do whatever they want, obviously.
|
| But since it's open source they're likely aiming for more
| developer mindshare. To have that they should make their
| message crystal clear.
| swyx wrote:
| but it would be nice if they spent a bit of time documenting
| themselves a bit more, same as you would want from any
| coworker at a big company. this isnt just marketing its good
| developer hygiene .
| x0x0 wrote:
| The link kind of buried in the upper right explains
|
| https://docs.wabarc.eu.org/
| baq wrote:
| Perhaps this should be integrated into browsers? Bonus points for
| automatic push to the wayback machine in a privacy preserving
| manner...
| klysm wrote:
| Seems hard to determine if a page has personal content. The
| archiving should probably only be done on clients that don't
| have credentials
| mattrighetti wrote:
| When I initially saw the project I thought it was some kind of a
| WebArchive running locally, but this is not what it does right?
|
| > Wayback one or more url to Internet Archive and archive.today
|
| > Wayback url to Internet Archive or archive.today or IPFS
|
| Correct me if I'm wrong but... does this forward links to those
| services and does not actually run the archiving process locally?
| In that case I would make the argument that it's not really what
| _self-hosted_ conveys.
| pastage wrote:
| > Chromium: Wayback uses a headless Chromium to capture web
| pages for archiving purposes. [0]
|
| I think it is a local spider. This quote is the only clear
| indication that they actually spider themselves. There are lots
| of implicit statements as well.
|
| [0] https://docs.wabarc.eu.org/installation/
| bomewish wrote:
| Gotta agree that it's really darn confusing as to what this
| thing actually does. Seems like a pretty big project that some
| talented people have poured a lot of time into, but what am I
| getting into here?
|
| Does it (1) download archived urls already on archive.org
| locally? or (2) spider websites I choose from scratch and save
| them locally (or on archive.org??)
|
| Totally unclear. These are very different things. Want to
| install and try it but also a little wary given the opacity of
| explanation about the basic functionality and purpose of the
| tool!
| navigate8310 wrote:
| Why is this better than simply passing a cURL such as
| archive.today/example.com/post1
| GTP wrote:
| I think that if you use the IPFS option, you can pin the
| generated files so you have them locally.
| anthk wrote:
| http://theoldnet.com
|
| For old or modern browsers without JS, such as DilloNG or
| Netsurf.
| kwhitefoot wrote:
| If I want to keep a copy of a web page I use the SingleFile
| Firefox add-on. It saves a copy of the rendered page and can
| embed images as data URLs.
___________________________________________________________________
(page generated 2023-04-16 23:02 UTC)