[HN Gopher] Wayback: Self-hosted archiving service integrated wi...
       ___________________________________________________________________
        
       Wayback: Self-hosted archiving service integrated with Internet
       Archive
        
       Author : thunderbong
       Score  : 217 points
       Date   : 2023-04-16 02:56 UTC (20 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | JamesAdir wrote:
       | Any recommendation for a tool that can crawl and download an
       | entire website complely and save it locally?
        
         | thunderbong wrote:
         | HTTrack Website Copier
         | 
         | https://www.httrack.com/
        
           | JamesAdir wrote:
           | thanks, tried it, but it has a problem with Unicode URL's
           | unless there is some setting I'm missing.
        
             | gala8y wrote:
             | try adding +*{unicode} to scan rules. seriously, please try
             | and let me know, if it worked.
        
         | arboles wrote:
         | Browsertrix or Brozzler, which crawl using headless browsers
         | for accuracy.
        
       | [deleted]
        
       | 9dev wrote:
       | > Wayback is a tool that supports running as a command-line tool
       | and docker container, purpose to snapshot webpage to time
       | capsules. > > Supported Golang version: See
       | .github/workflows/testing.yml
       | 
       | The summary is hilarious! I still have not the slightest idea
       | what it does, why I should care, or what it's good for. What are
       | "time capsules", and what does "snapshot" mean in this context?
        
         | joshspankit wrote:
         | It seems to make the assumption that people already know how
         | the Internet Archive's "wayback" service works, or at least
         | what it's purpose is.
         | 
         | The essence is that web pages change over time and can be taken
         | down or otherwise lost. A "snapshot" is a capture a webpage at
         | a specific time, and I assume that "time capsule" is some type
         | of format that holds the snapshot as well as the extra
         | metadata. The result is something that can be used later to see
         | that website as it was.
        
         | squarefoot wrote:
         | Not the best explanation for sure. It seems a tool that can be
         | used to offload and potentially decentralize some archiving
         | work from the Internet Archive, with some privacy/anonymity
         | added to help against censorship. About time I'd say, however
         | I'm not sure if that can be used for bare files as well: books,
         | retrogaming, software, etc.
        
       | angelmm wrote:
       | I love all these kind of projects as I tend to be paranoid of
       | losing good online content.
       | 
       | It's also unclear to me how wWayback works. It seems more like an
       | API than a self-hosted service.
       | 
       | I'm currently using ArchiveBox [0], which provides a complete API
       | + UI.
       | 
       | - [0] https://archivebox.io/
        
         | rzzzt wrote:
         | Are you using all extractors when saving a page?
         | 
         | I tried ArchiveBox and Shiori, but neither stuck for some
         | reason. The latter is a bit more lightweight, it can save the
         | entire page as well as a Readability-based conversion:
         | https://github.com/go-shiori/shiori/
        
           | Linux-Fan wrote:
           | I am not angelmm, but another happy ArchiveBox user.
           | 
           | My choice of extractors is the following: Singlefile, PDF,
           | Screenshot, archive.org.
           | 
           | I found the largest issue with any website archiving tools to
           | be the discrepancy between what I see in my Web Browser and
           | what is saved. The most "reliable" way that still works today
           | for me seems to be the "Save Page WE" Firefox plugin.
           | 
           | I have a sidecar container running that checks for HTML files
           | appearing in a directory, triggers the archivebox save and
           | then overwrites the "singlefile" capture by the provided HTML
           | file. This way, I can trigger archiving by just using the
           | Save Page WE plugin and storing the resulting HTML file in
           | the directory.
        
             | zerkten wrote:
             | Can archive box aggregate content "bookmarked" in different
             | places?
             | 
             | I want a tool that will pull saved items on Reddit,
             | favorite posts on HN, etc. in addition to bookmarks posted
             | to pinboard.in to a single place. In many ways, the share
             | functionality in iOS allows me to get all URLs into a
             | single place, but this doesn't help on desktop.
             | 
             | I know this isn't necessarily and easy task, if APIs aren't
             | available. I'd be OK with a client component or browser
             | extension, if it was open source and self-hostable.
        
               | Linux-Fan wrote:
               | AFAIK it does not contain any function to support this
               | directly.
               | 
               | My primary way to interact with Archive Box for such
               | purposes is by calling it on the command line. Scripts
               | may be used to obtain the URLs of interest from any
               | source.
               | 
               | When I started with Archive Box I had some existing
               | downloaded Websites from ScrapBook and Save Page WE
               | already. I used some hacky scripts to extract the URLs
               | from the respective pages and overwrite Archive Box'
               | downloaded copies by my original copies as to make it
               | work for pages that had been deleted in the meantime. All
               | my data sources were local/desktop though.
        
           | Handprint4469 wrote:
           | ArchiveBox also saves a Readability version:
           | 
           | > Article Text: article.html/json Article text extraction
           | using Readability & Mercury [0]
           | 
           | [0]: https://archivebox.io/#output-formats
        
       | unintendedcons wrote:
       | For archiving, look into https://github.com/dosyago/DiskerNet
       | 
       | It's real next gen thinking on this topic.
       | 
       | As for the featured tool wayback... If HN readers can't figure
       | out what it does after reading docs, its likely the thinking
       | behind it is equally unclear.
        
         | mellosouls wrote:
         | Does it actually record web archives (ie warcs)? I couldn't
         | work out from a quick look at the repo whether it does that or
         | not, though it claims to make your archives shareable.
         | 
         | There's a long-existing web archiving ecosystem with
         | established formats for recording and publishing archives.
         | 
         | The repo linked could probably do with some clarity itself in
         | how it does or doesn't fit in with standards.
        
         | uniqueuid wrote:
         | diskernet is certainly interesting because it records archives
         | as you browse.
         | 
         | But it's not open source and pretty limited regarding the use
         | case.
         | 
         | The thing is, archiving is a multi-faceted and hard problem
         | (i.e. video content, live streams, interactive sites), and will
         | remain so. A complicated task leads to complicated tools.
        
         | jsiepkes wrote:
         | DiskerNet looks cool but is apparently commercial software?
         | Even for personal use you need a license I think?
        
         | gala8y wrote:
         | Looking at the link you gave does not help much in seeing what
         | DiskerNet does and looks like, neither.
         | 
         | Keeping it simple, I download pages in Markdown adding some
         | metadata (some tags). When I want images or more I use
         | singlefile extension. Add Recoll to the mix and that's all I
         | need.
         | 
         | https://github.com/deathau/markdownload
         | 
         | https://github.com/gildas-lormeau/SingleFile
         | 
         | https://www.lesbonscomptes.com/recoll/pages/index-recoll.htm...
        
           | kuschkufan wrote:
           | What about videos (embedded or not)?
        
             | gala8y wrote:
             | yt-dlp is good enough for most cases for me.
        
       | pdimitar wrote:
       | Only 3 matches of the word "local" in the README and they don't
       | seem to refer to any self-hosting whatsoever.
       | 
       | What does this tool do exactly? Send URLs to remote services for
       | them to do snapshots?
       | 
       | That README is a classic case of a filter bubble syndrome.
       | People, it can't physically hurt you to add 1-2 sentences saying
       | what _exactly_ does your tool do!
        
         | pmontra wrote:
         | It also stores locally. One of the last bullet points is
         | 
         | > Supports storing archived files to disk for offline use
        
           | pdimitar wrote:
           | Ah, I see it now, thanks. IMO it should be put front and
           | center and be at the start of the project's description.
        
         | polynox wrote:
         | I would not complain about a free open source tool because they
         | don't do a good job of marketing it.
        
           | pdimitar wrote:
           | They can do whatever they want, obviously.
           | 
           | But since it's open source they're likely aiming for more
           | developer mindshare. To have that they should make their
           | message crystal clear.
        
           | swyx wrote:
           | but it would be nice if they spent a bit of time documenting
           | themselves a bit more, same as you would want from any
           | coworker at a big company. this isnt just marketing its good
           | developer hygiene .
        
         | x0x0 wrote:
         | The link kind of buried in the upper right explains
         | 
         | https://docs.wabarc.eu.org/
        
       | baq wrote:
       | Perhaps this should be integrated into browsers? Bonus points for
       | automatic push to the wayback machine in a privacy preserving
       | manner...
        
         | klysm wrote:
         | Seems hard to determine if a page has personal content. The
         | archiving should probably only be done on clients that don't
         | have credentials
        
       | mattrighetti wrote:
       | When I initially saw the project I thought it was some kind of a
       | WebArchive running locally, but this is not what it does right?
       | 
       | > Wayback one or more url to Internet Archive and archive.today
       | 
       | > Wayback url to Internet Archive or archive.today or IPFS
       | 
       | Correct me if I'm wrong but... does this forward links to those
       | services and does not actually run the archiving process locally?
       | In that case I would make the argument that it's not really what
       | _self-hosted_ conveys.
        
         | pastage wrote:
         | > Chromium: Wayback uses a headless Chromium to capture web
         | pages for archiving purposes. [0]
         | 
         | I think it is a local spider. This quote is the only clear
         | indication that they actually spider themselves. There are lots
         | of implicit statements as well.
         | 
         | [0] https://docs.wabarc.eu.org/installation/
        
         | bomewish wrote:
         | Gotta agree that it's really darn confusing as to what this
         | thing actually does. Seems like a pretty big project that some
         | talented people have poured a lot of time into, but what am I
         | getting into here?
         | 
         | Does it (1) download archived urls already on archive.org
         | locally? or (2) spider websites I choose from scratch and save
         | them locally (or on archive.org??)
         | 
         | Totally unclear. These are very different things. Want to
         | install and try it but also a little wary given the opacity of
         | explanation about the basic functionality and purpose of the
         | tool!
        
           | navigate8310 wrote:
           | Why is this better than simply passing a cURL such as
           | archive.today/example.com/post1
        
         | GTP wrote:
         | I think that if you use the IPFS option, you can pin the
         | generated files so you have them locally.
        
       | anthk wrote:
       | http://theoldnet.com
       | 
       | For old or modern browsers without JS, such as DilloNG or
       | Netsurf.
        
       | kwhitefoot wrote:
       | If I want to keep a copy of a web page I use the SingleFile
       | Firefox add-on. It saves a copy of the rendered page and can
       | embed images as data URLs.
        
       ___________________________________________________________________
       (page generated 2023-04-16 23:02 UTC)