[HN Gopher] Internet Archive as a default host-of-record for sta...
       ___________________________________________________________________
        
       Internet Archive as a default host-of-record for startups
        
       Author : bpierre
       Score  : 255 points
       Date   : 2021-12-21 16:51 UTC (6 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | mwattsun wrote:
       | I think it's interesting to think about what we have lost because
       | we couldn't keep everything from a 100 years ago and what society
       | 100 years from now will be grateful we preserved.
       | 
       | Off the top of my head, we lost a lot of common wisdom in dealing
       | with the flu pandemic of 1918 because personal letters and most
       | newspapers were not preserved. I think 100 years from now they
       | might wish we had preserved more from marginal and/or world
       | communities. What folk wisdom is being lost? Perhaps we need to
       | expand our definition of what is worth saving.
        
         | bananamerica wrote:
         | Hard to know what will be of interest for future historians.
         | Some things in which we place great value can be considered
         | irrelevant, while some of our junk can become historical gold.
        
           | boarnoah wrote:
           | The mundane of today is very insightful for tomorrow's
           | historians.
           | 
           | Its fascinating when you start looking into any historical
           | time period (you wouldn't even need to go far back), before a
           | lot of details are educated guesses. Since no one chose to
           | record the mundane in detail or it failed to preserve over
           | time.
        
           | mwattsun wrote:
           | > some of our junk can become historical gold
           | 
           | That's what interests me. For example, there's a cool
           | repository of 12 step speaker meeting talks hosted in Iceland
           | [1] and frankly, some of the talks are junk, but there's a
           | lot of wisdom. What I find interesting is how it showcases
           | how ordinary citizens talk to each other. The words they use,
           | the accents, the little gems of folk wisdom contained, along
           | with some uncommon stories.
           | 
           | This will be valuable 100 years from now if, for example, you
           | want to build a virtual world based in the mid to late 20th
           | century and you want to get the accents correct. What phrases
           | did people use? What were some common misconceptions? Maybe
           | 100 years from now addiction will no longer be a problem. If
           | my virtual world is to be accurate I need to know what it was
           | like for ordinary people when it was a problem. Etc...
           | 
           | [1] https://xa-speakers.org/
        
             | bananamerica wrote:
             | That's a great example, thanks. Funny enough, I got that
             | insight from _Bill & Ted's Excellent Adventure_. In the
             | end, the most important thing in the future was some 80s
             | rock song.
        
         | lelandfe wrote:
         | An excellent place to start is talking to your parents and
         | grandparents and recording their history and stories online.
        
           | mwattsun wrote:
           | That's a good suggestion I've been following
        
       | AlexanderTheGr8 wrote:
       | What compression does IA use to store websites? Using a 2x better
       | compression will allow them to store 2x more websites/content.
       | 
       | I am doing some compression research, and would love to help IA
       | in any way I can. There are some amazing SOTA compression
       | algorithms available now.
       | 
       | And if IA ignores images/video, and focuses only on text, they
       | can store an insane amount of websites at a very low cost.
        
       | nur23kg wrote:
        
       | midoBB wrote:
       | hhh
        
       | tonymet wrote:
       | Is it really worth archiving ?
        
         | maxbond wrote:
         | It's difficult to know beforehand! A startup might be a total
         | flop that delivers nothing of value to us now, but it might be
         | a valuable datapoint for future historians to understand how
         | startup culture changed and evolved. Or it may be interesting
         | to future founders - I've seen some startups with perfectly
         | fine ideas fail, and then a few years later, someone succeeds
         | doing something very similar.
         | 
         | This may not be the best example, but while I'm sure the
         | Rosetta Stone, being a treaty, likely would have seemed like
         | something worth preserving, would anyone have imagined that it
         | would be the pivotal document in understanding ancient
         | Egyptian? That it would be one of the most important documents
         | of all time?
        
       | riobard wrote:
       | Correct me if I'm wrong, but isn't this problem the ideal use
       | case for projects like IPFS? Anyone interested to preserve the
       | content can join as a node to balance the load, right? And if so,
       | why don't we see widespread adoption?
        
         | jayd16 wrote:
         | IPFS does the opposite, right? It doesn't guarantee the archive
         | is available, which is what Carmack is asking for. Incentives
         | to scale bandwidth with need already exist as long as you have
         | he data at all.
         | 
         | That is to say, IPFS doesn't help if the desire blooms after
         | the nodes dry up. Things could still be lost.
        
           | Ericson2314 wrote:
           | The point of IPFS is not keep the data archived, but to
           | allows users to no care who does the archiving.
           | 
           | Concretely, this would be to skip the "many people on
           | encountering a dead URL don't bother to try the internet
           | archive" problem.
        
         | ricardobayes wrote:
         | There are many a projects who go open source when they fail.
         | The extra step here is IA would become the A record for the
         | project (at least temporarily?)
        
         | Ericson2314 wrote:
         | I was involved with planning
         | https://nlnet.nl/project/SoftwareHeritage-P2P/ for just this
         | reason --- hopefully we will finally be able to start work on
         | it sometime too far off.
         | 
         | Indeed the _real_ challenge of archival is not loosing the
         | stuff, by making sure that people can still find the stuff.
         | "Orphaned" information that no one knows exists, or is
         | bothering to interact with, isn't that valuable compared to
         | resources that are actively being used and still "live" in the
         | culture.
         | 
         | Of course, the archive can never serve the same amount of
         | bandwidth, but the goal is a) interested parties can mirror the
         | stuff they care about in a higher bandwidth / item way after
         | some huge disruption c) random viewers never notice something
         | going down, nor who is serving the info, but just a temporary
         | drop in connection quality.
         | 
         | Ultimately, location-based addressing is a stupid way to run
         | society, needlessly fragile by baking in very property claims
         | (IPs, DNS, etc.) that are incidental to the task at hand.
         | Content-based addressing, with location based hints to avoid
         | trying to solve really hard problems all at once, is the only
         | way to make culture more robust.
        
           | btown wrote:
           | The great thing about location-based addressing is that an
           | archive of the set of known locations is not subject to the
           | same ownership rules as the canonical live version of those
           | addresses. A document listing all Geocities URLs can be
           | placed in content-addressed storage without needing
           | geocities.com to be owned by the party that emplaces that
           | document. And a chain can be maintained such that people are
           | incentivized to remember that document into the far future.
           | Coupled with archival of the actual content, you bypass the
           | exclusivity of domain ownership.
           | 
           | Of course, ensuring that there's persistence of _attention_
           | as well is a tougher problem. But one only needs to look at
           | sites like https://reddit.com/r/tumblr to realize that there
           | is immense societal interest in "meme archaeology." Reducing
           | the barriers to entry to would-be archaeologists, giving them
           | a "chain" of breadcrumbs that lead to content, and building
           | communities that will socially reward people for their
           | archaeology work, is the best thing we can possibly do.
        
             | Ericson2314 wrote:
             | > The great thing about location-based addressing is that
             | an archive of the set of known locations is not subject to
             | the same ownership rules as the canonical live version of
             | those addresses.
             | 
             | Erm, to me this sounds like putting up with link rot as
             | hack around bad IP law? There are already IP exceptions for
             | preservation. And if content-addressing was the norm,
             | geocities-type sites might bow to market pressure to not
             | "own" the content, but merely have some some sort of
             | license for being the exclusive pinning service and running
             | the ads or whatever. This is like avoiding the problem
             | where your the rent on your current apartment doesn't fall
             | as much as the market writ large because your landlord
             | knows moving is not free.
        
         | thesausageking wrote:
         | IPFS only does addressability, it doesn't provide storage. You
         | could use a decentralized storage network like Arweave,
         | Filecoin, or Sia.
         | 
         | https://www.arweave.org/
         | 
         | https://www.filecoin.com/
         | 
         | http://sia.tech/
        
           | cle wrote:
           | Or "centralized" ones like Fleek, Textile, Pinata, etc.
           | 
           | https://fleek.co/hosting/
           | 
           | https://docs.textile.io/buckets/
           | 
           | https://www.pinata.cloud/
        
         | [deleted]
        
         | dannyobrien wrote:
         | It is -- and Brewster Kahle and the Archive have been thinking
         | about this for a long while (see this talk from him five years
         | ago: https://archive.org/details/LockingTheWebOpen_2016 ). The
         | model you can think for this would be to have as the Archive as
         | the "node of last resort" of content-addressable storage,
         | making sure there's always one node up with the content you
         | want.
         | 
         | The incentive challenges are making sure that the average
         | number of nodes is _more_ than one, because, as Brewster likes
         | to say,  "libraries burn; it's what they do", plus all the
         | traditional challenges of maintaining a commons at high levels
         | of resilience. Once you have data on a network like IPFS, we
         | can use a number of incentive models to make sure it stays
         | there, including charitable projects like the Archive,
         | government support (archives are traditionally state projects
         | -- if every country's archive was pinning this content, it
         | would be far more resilient), and decentralized incentive
         | frameworks like Filecoin.
         | 
         | (Disclosure: I work for the Filecoin Foundation; in our
         | decentralized preservation work, we've funded the Internet
         | Archive's work in this area, though I should emphasise that IA
         | works with a lot of different decentralizing technologies
         | through their https://getdweb.net/ community.)
        
       | nickdothutton wrote:
       | I wrote a little about this and the alternatives here, especially
       | for sites with user-generated content which may be impossible to
       | find/regenerate from elsewhere.
       | 
       | I call it "Beating the Samson Option", of pulling the temple down
       | upon your head.
       | 
       | https://blog.eutopian.io/beating-the-samson-option/
        
       | pjc50 wrote:
       | I'm old enough to remember when the host-of-record for failed
       | startups was fuckedcompany.com ...
       | 
       | I do wonder how many startups actually _want_ to be archived,
       | rather than just ditch everything with unseemly speed as soon as
       | they get acquishutdown.
        
         | monkeydust wrote:
         | There's huge valuable learnings to be had in failure. Through
         | some coordination they could perhaps get compensation for
         | sharing although this goes against it all being open and free.
        
       | kodah wrote:
       | Personally, I think eternally archiving everything and infinitely
       | available public data has been not-so-great. If this was an
       | "archive with consent" sort of system, then sure. My response may
       | be better summarized as, "Does IA support robots.txt, and if not
       | why?"
        
         | rectang wrote:
         | Throughout human history, records have been forgotten,
         | rewritten, changed, mutated, degraded, eroded away to
         | nothingness. "The internet is forever" has always struck me as
         | inhumane. Make a mistake or expose a weakness on the internet
         | and it will always accompany you.
         | 
         | It turns out that the internet is not always forever. I find
         | that comforting.
        
           | treesknees wrote:
           | Depends on how you think of "forever". If you post an
           | embarrassing video and someone saves and reposts it with your
           | name attached, odds are that video isn't going to be around
           | in 200 years. But what about the next 10, 20 or 40 years? In
           | the context of your overall professional adult life, that's a
           | long time.
           | 
           | "The Internet is forever" isn't some natural law by which all
           | content abides as though it can never disappear. It's a
           | warning that you don't control the content once it's
           | accessible on the Internet.
        
           | symlinkk wrote:
           | Maybe humans should become more accommodating of past
           | mistakes.
        
             | _jal wrote:
             | Suggest that to them. I'm sure they'll get right on it.
        
           | oconnor663 wrote:
           | I think there's inhumanity of a kind on both sides of this
           | question. "Everything you've done will be forgotten, and no
           | one will remember your name" is the sort of thing the bad
           | guys say in movies. But that's what happens to most of us in
           | the end. I think it's natural not to want that.
        
         | garaetjjte wrote:
         | You're consenting by posting it on public internet in the first
         | place.
        
           | ghaff wrote:
           | Posting something on the public internet is not consent for
           | you to scrape it and post it on your own site forever.
           | 
           | And requiring an explicit opt-in would basically mean no IA.
           | 
           | To be clear, the IA is a positive, maybe even a great one.
           | But it skirts by because most people don't care. (They did as
           | you say post whatever on the public internet.) Add the facts
           | that they're a non-profit, aren't trying to monetize their
           | hosting, and will generally take things if the owner asks.
           | 
           | Libraries and other archives have some very limited special
           | rights (which mostly relate to making physical backups of
           | physical books). But invoking "library" isn't some general
           | get out of jail free card with respect to copyright.
        
             | blackearl wrote:
             | People have recorded others without their consent for
             | millennia. Whether it's telling a story about what someone
             | said, a photo, or now a screenshot of a twitter post, that
             | is reality. You'll never be able to stop someone from
             | telling another that you said X.
        
             | _jal wrote:
             | > Posting something on the public internet is not consent
             | for you to scrape it and post it on your own site forever.
             | 
             | It effectively is. Your consent is not required, and people
             | are doing far worse than just keeping it available
             | (Clearview; there are also reports of people hoovering up
             | encrypted data to crack in the coming decades when we're
             | post-quantum).
             | 
             | This is no different than demanding people not keep track
             | of anything else, and attacking archive.org might make you
             | feel better, but that won't make anyone else stop.
        
             | pessimizer wrote:
             | It can be. The reason libraries and other archives have
             | special rights is because they fought for them against the
             | express wishes of people who sold paper. There are no
             | arguments made against archive.org that weren't also made
             | against libraries.
        
               | Mezzie wrote:
               | Thank you.
               | 
               | These rights are also under constant attack: It's normal
               | to charge libraries exorbitant prices for digital
               | materials compared to their analogue counterparts, for
               | example.
        
           | netrus wrote:
           | Do you think most people who post publicly on the internet
           | would agree if asked? I think most people would like to have
           | a choice to make old stuff disappear. If most people think
           | so, THAT should be the rule. You might not like that and
           | argue that there is no way to enforce it, but that does not
           | mean it is a good rule to assume consent.
        
             | ghaff wrote:
             | There are two things:
             | 
             | 1.) Most people won't opt-in because a significant majority
             | accept defaults and don't opt into most things.
             | 
             | 2.) For people like yourself probing a bit deeper, you
             | might well ask whether you really want to give up your
             | ability to decide you don't want something you thought was
             | so funny when you wrote it at 20 now that you're a
             | politician running for office or up for a political
             | appointment.
        
               | kodah wrote:
               | I mean even honoring the opt-out of robots.txt would be
               | fantastic. As another commenter pointed out they
               | willfully ignore it:
               | https://blog.archive.org/2017/04/17/robots-txt-meant-for-
               | sea...
               | 
               | It's fairly unethical.
        
             | BeFlatXIII wrote:
             | How about this? Archives will still exist regardless of
             | consent. Does that makes them digital rapists?
        
         | FrostKiwi wrote:
         | Public data is... public. No one should be stopped from saving
         | a public page and nothing should stop Internet Archive, be it
         | robot or human. There is if course a need for removal of
         | archived content infringing on someone's rights, whatever that
         | might be, but "archive with consent" will fail for the goal of
         | preserving culture. I think it's worrying that some online
         | newspapers enacted archive blockers or IA needing DMCA
         | excemptions, just so companies can't DMCA anything with their
         | name on it. To preserve journalistic integreity and to save
         | culture, even if it collides intellectual property rights,
         | "archive with consent" won't cut it.
        
           | orev wrote:
           | Content posted on a web site is NOT "public" (domain), it is
           | (in the US) automatically copyrighted to the author, unless
           | they specifically waive those rights. Just because you can
           | see it through a browser doesn't in any way mean you can make
           | it yours and do what you want with it.
        
             | ImprobableTruth wrote:
             | We store books in public libraries even if they aren't
             | public domain and authors can do nothing to prevent them
             | from doing so.
        
               | ghaff wrote:
               | First sale doctrine in the US. If I buy a physical book,
               | I can give it, loan it, throw it in the trash, etc. This
               | doesn't apply to making electronic copies--see what
               | happened with Google Books for example.
        
               | rat9988 wrote:
               | If you buy the book, you have an authorization to do so.
               | But you can't put copies of it.
        
               | Mezzie wrote:
               | You do if you're an archive, actually.
        
             | FrostKiwi wrote:
             | Absolutely true. "unless they specifically waive those
             | rights" - if archiving entailed contacting the owner with a
             | legal archive request, we would have archived basically
             | nothing. Luckily there are exceptions for Internet Archive
             | in place. My point is, if "by consent" was the requirement
             | to archive information, we would have archived nothing.
        
             | Mezzie wrote:
             | It doesn't matter.
             | 
             | Archives are exempt from being forbidden to create copies
             | due to copyright infringement. The Library of Congress can
             | make all the copies it wants, it just can't SELL them.
             | 
             | Now, there is a question whether a private company should
             | legally be able to BE an archive of record, but as of now
             | there's no legal reason they can't be, I believe. So it's
             | legal.
        
         | 1MachineElf wrote:
         | There is some public discussion about why IA does not strictly
         | adhere to robots.txt:
         | 
         | https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
        
           | kodah wrote:
           | They're basically saying they're choosing to ignore a web
           | convention that explicitly states that people don't want
           | their websites archived or searchable because they want them
           | to be. Sounds pretty unethical to me.
        
         | Ericson2314 wrote:
         | There are clearly some things we, and the author, wants to
         | persist, and yet the internet fails to do so.
         | 
         | Trying to muddle in privacy concerns with the accumulation of
         | public knowledge undermines the whole concept of a shared
         | society, or there being even the potential of accumulating
         | "progress" in the first place.
         | 
         | Also, historically speaking, people with means used to save
         | their letters for posterity, which proved to be a very valuable
         | resource for future academics, so the idea of, what, deleting
         | all your proton mails and signal messages as encouraged is
         | arguably overshooting the return to some pre-internet norm.
        
           | kodah wrote:
           | > Also, historically speaking, people with means used to save
           | their letters for posterity, which proved to be a very
           | valuable resource for future academics
           | 
           | Your example is an example of _choice_ or consent. They also
           | had the option to burn their hand written books and scrolls
           | down periodically. Systems like IA take that choice away.
        
             | Ericson2314 wrote:
             | There is no reason to do person communication with
             | staticish websites. Don't foist the problems of social
             | media onto the Internet Archive.
        
               | kodah wrote:
               | I'm not sure I understand. robots.txt was a matter of
               | consent and was standardized well over twenty years ago.
               | The IA willfully ignores it because they believe it
               | interferes with their mission. This was long before
               | social media, when static websites were more dominant
               | than dynamic ones.
               | 
               | Source: https://blog.archive.org/2017/04/17/robots-txt-
               | meant-for-sea...
        
           | luckylion wrote:
           | "You shouldn't have privacy because it undermines the concept
           | of a shared society, and also future historians might find
           | your life interesting."
           | 
           | Thanks.
        
             | Ericson2314 wrote:
             | Rediculus Strawman
        
       | thrdbndndn wrote:
       | I genuinely didn't understand the point he's trying to make.
       | 
       | Could someone ELI5?
        
       | 1vuio0pswjnm7 wrote:
       | Why is there only one IA?
       | 
       | Why is IA not globally distributed, like a CDN?
       | 
       | I use IA for "problem" websites, e.g., ones that rely on SNI,
       | i.e., ones hosted at certain CDNs. I simply add add these sites
       | to a list and the local proxy does the rest.
       | http-request set-uri
       | https://web.archive.org/web/1if_/http://%[req.hdr(host)]%[pathq]
       | if { hdr(host) -m str -f list }
       | 
       | IA "hosts" an enormous number of sites without the need for SNI
       | (plaintext hostnames sent over the wire).
       | 
       | EDIT: @sebow the way they (re)format the HTML is less friendly to
       | the text-only browser I use.
        
         | thrdbndndn wrote:
         | > 1if_
         | 
         | This is a neat shortcut to simply get the very first archived
         | version! I often have to go to /*/ and manually click on one of
         | them, which is very tiring.
         | 
         | Is there one to get the latest?
        
         | sebow wrote:
        
         | amenghra wrote:
         | What's your issue with SNI/threat model? If you use a non-SNI
         | site, anyone can tell which site you are visiting since there's
         | only one domain on that IP.
        
           | 1vuio0pswjnm7 wrote:
           | There is a difference between making something "impossible"
           | and making something "easier". Performing reverse DNS
           | lookups, or otherwise trying to maintain a global table of
           | 1:1 domain:IP mappings and perform lookups in real-time, is
           | nowhere near as easy nor reliable as sniffing SNI. IME, it is
           | neither easy nor reliable, nor worth the effort. SNI is the
           | preferred method. SNI is easier. SNI is 100% reliable for
           | detecting what hostname the user is trying to access.
           | 
           | What is the point of so-called "DNS privacy/Private DNS" if
           | "anyone can tell which site you are visiting" simply by
           | observing IP addresses, without any need to see domainnames.
           | 
           | If SNI (plaintext hostnames sent over the wire) is a non-
           | issue, then why are people working on encrypted Client Hello
           | in TLS1.3.
        
           | yellowapple wrote:
           | Seconding that curiosity. Pretty much every web server I've
           | built uses SNI (at least if it's hosting sites under multiple
           | domains), and the only "downside" of which I'm aware is the
           | lack of IE6 support.
        
       | zaps wrote:
       | I was with him up until "blockchain".
        
       | markjgraham wrote:
       | Hi,
       | 
       | I manage the Wayback Machine at the Internet Archive.
       | 
       | Very happy so many people here care about preserving, and making
       | available, our cultural heritage!
       | 
       | Please know a dedicated, and talented, team of engineers works
       | every day to do a better job of archiving more of the public Web,
       | and making it available via the Wayback Machine.
       | 
       | As noted the Internet Archive is experimenting with filecoin.io
       | and storj.io and is always open to suggestions about how we might
       | do our jobs better, and improve our service. We also host regular
       | meetups (and have hosted summits and a camp) related to the
       | Decentralized Web. See: https://blog.archive.org/tag/dweb/
       | 
       | The Internet Archive also offers archive-it.org, a subscription
       | service, for those who want a higher level of support and more
       | features.
       | 
       | We appreciate any support you can offer, financial and otherwise.
       | Please share any bug reports, feature suggestions and other
       | feedback with us via email to info@archive.org
       | 
       | Oh/and... checkout the new PDF Search feature we just launched at
       | the bottom of web.archive.org. More to come like that in 2022.
       | 
       | Finally, you might also find some of the things I wrote here of
       | interest: https://gijn.org/2021/05/05/tips-for-using-the-
       | internet-arch...
        
       | a-dub wrote:
       | it sounds like he wants to be able to archive and respinup/run
       | things like mmoprg game servers as easily as static content can
       | be archived and served today.
       | 
       | that would be a huge and expensive paradigm shift for internet
       | service backends that traditionally have been only designed to be
       | run by one entity and typically are a mix of custom, open source
       | and proprietary software that is run in a specific way.
       | 
       | i suppose it could be done, but there hasn't been any reason to
       | make that investment. service backends also tend to be more
       | "living software" where part of the system is the team that
       | continuously builds, updates and operates it.
       | 
       | basically, it would look something like java applets, but for
       | entire internet service backends. one step and all the services,
       | databases, everything would spin up and start serving. that would
       | be great but is probably a ways out.
        
         | jayd16 wrote:
         | Well there's other static things like FAQs, news feeds, even
         | hosted game content that a game might rely on. It's not just
         | dynamic services. I think Carmack is referring to that.
        
           | a-dub wrote:
           | ..."something like this could combine with a blockchain style
           | technology to make internet applications that could outlive
           | companies. A niche multiuser game that couldn't meet company
           | revenue goals could still be "fed" by anyone that wanted to
           | push resources at it, since the "
           | 
           | that's not static assets, that's a multiuser game server.
           | 
           | i think he's envisioning entire internet service backends
           | that can be packaged up like java applets and re-run on
           | demand paired with some kind of decentralized serving
           | infrastructure that any user can insert coins and resurrect a
           | sophisticated web service from the past.
           | 
           | more likely i suspect we'll see more efforts by hobbyists to
           | resurrect these things and more releases of backends from
           | failed projects into the public domain.
           | 
           | with so much physical gear that requires service backends
           | being made today, we may even see regulation that requires
           | release of the source for a service when a service is shut
           | down. crazy to think that if the company who made your car or
           | tractor fails, that your perfectly good car or tractor could
           | cease function when they shut down the service backend.
        
       | 908087 wrote:
        
       | msla wrote:
       | I'd love it if the Wayback Machiine were less touchy/more
       | reliable.
       | 
       | The failure mode I see very often is that the frontend apparently
       | doesn't know what the backend's doing: The part which ingests
       | URLs and tells you what URLs have been archived does not know
       | what archives the backend has, so it will tell you a page has
       | been archived and give you a link to the archive, but when you
       | click the link, it tells you it does _not_ have the page
       | archived, oh, look, it exists online, would you like to archive
       | it now? Archive it again, and it will tell you that you can only
       | archive a page once every 45 minutes. If you 're a weird little
       | obsessive like myself, you go through this process a half-dozen
       | times for one page before it acknowledges that, yes, it does have
       | the page archived ( _once_ , mind you) and you can actually see
       | it.
       | 
       | While I'm filing bitch reports...
       | 
       | The Wayback Machine apparently loves setting cookies. It will set
       | cookies until it has exceeded its own ability to _accept_
       | cookies, at which point it will give you a blank page and you
       | have to look in the developer console to figure out that it sent
       | you a  "too many cookies" error in the response header. I've had
       | to force my browsers to not accept any cookies from the Internet
       | Archive to fix this.
        
       | alberth wrote:
       | >>" I wonder if there could be a world where the IA acts as a
       | default host-of-record for startups, with a super-easy CDN
       | relationship such that the content"
       | 
       | Doesn't IA already partner with Cloudflare to do exactly what
       | Carmack is suggesting.
       | 
       | https://blog.cloudflare.com/cloudflares-always-online-and-th...
        
         | magila wrote:
         | You still need to provide a separate origin server (i.e. host-
         | of-record) when using Always Online. AO is designed to use IA
         | as a backstop when your origin server goes down.
        
       | kderbyma wrote:
       | I love this idea. I am working on a mini project for a
       | decentralized game which can be self hosted or snapshots - so you
       | can share / modify a portion of it and provide a custom
       | experience
        
       | dleslie wrote:
       | There's nothing about his suggestion that requires use of a
       | Blockchain. IA could certify authenticity easily without it.
        
       | simonw wrote:
       | The feature I most want from the Internet Archive is the ability
       | to donate them an old domain name and enough cash to renew it for
       | the next hundred years such that they can keep an archived
       | version of a site available (without breaking any incoming links)
       | for a very long time.
       | 
       | They would also need to be able to handle legal administration
       | costs of things like DMCA take-down notices, but I assume they
       | already have to deal with that for the rest of the archive so
       | hopefully that's not an extra complexity for them.
        
         | EGreg wrote:
         | Why not use MaidSAFE or IPFS for that?
        
           | hluska wrote:
           | That doesn't solve the 'everyone you know is dead' problem.
        
           | frakkingcylons wrote:
           | With regards to IPFS, you'd need to find a pinning service
           | that offers long-term contracts. Moreover, I have more
           | confidence in the Internet Archive in being around in 10, 20,
           | 30 years compared to any existing pinning service. That's not
           | meant to be a slight against said pinning services, just that
           | the IA is well established in their role.
        
             | EGreg wrote:
             | Check out Freenet
             | 
             | They drop stuff that has the least amount of accessing, so
             | if you want to keep something online you have to pay
             | someone to keep accessing it. Makes sense, but I'd argue
             | that it should be a market, meaning the price should go up
             | as more people try to access stuff, but anyone can start to
             | seed popular content and collect revenues also for hosting
             | it (this is better than wasting electricity on accessing
             | stuff or doing proof of work).
             | 
             | But isn't FileCoin exactly that for IPFS?
             | 
             | MaidSAFE goes a step further and has nodes rebalance
             | autonomously and earn the most safecoin, as something gets
             | more popular it gets seeded more.
        
         | Ericson2314 wrote:
         | A far cheaper solution is a browser extension that looks up DNS
         | differently based on the age of a the link.
         | 
         | It wouldn't be hard to maintain a hand-crafted database of when
         | domains are reused for something completely different, or even
         | when the same conceptual website has breakages, and use that to
         | choose between Internet Archive or live web accordingly. When
         | one is browsing from an internet archive page, the date is
         | known, when someone is browsing from a live website, heuristics
         | can be used, along with "bisecting" dates when the link is
         | dead.
         | 
         | Ultimately we want more content addressing to avoid this
         | problem entirely (see below), Or DNS -> PubKey, PubKey ->
         | latest content, with some law that the pubkeys shall not be
         | reused for unrelated things. vs DNS which is mere ephemeral
         | Huffman encoding. So see below for the stuff on IPFS. But the
         | trick above is a good stop-gap, and indeed the database itself
         | used to back the extension could be on IPFS.
        
           | zamadatix wrote:
           | Trying to figure out the age of links referenced against a
           | hand crafted database (or trying to figure out if the current
           | version is "too different" based on age automatically) using
           | a browser extension only serves to create an unreliable
           | solution for a few using the extension. Cheaper sure but it's
           | also doing a whole lot less.
           | 
           | Alternative content addressed systems may also work better
           | for finding the content than someone hosting your DNS records
           | for a long time but the bulk of the problem space is in
           | guaranteeing active hosting in a way viewable to viewers of
           | the age will be available for many years not addressing the
           | content. On top of hosting and addressing the Internet
           | Archive offers the ability to view old content on modern
           | browsers even if modern browsers have 0 support for such
           | content anymore (or if browsers ever had support at all
           | even). Forward compatibility isn't something solved by a
           | protocol.
        
             | Ericson2314 wrote:
             | > using a browser extension only serves to create an
             | unreliable solution for a few using the extension. Cheaper
             | sure but it's also doing a whole lot less.
             | 
             | This is rather pessimistic thinking. The same money that
             | goes into buying up domains could go into lobbying browsers
             | to add this functionality be default.
             | 
             | > bulk of the problem space is in guaranteeing active
             | hosting in a way viewable to viewers of the age will be
             | available for many years not addressing the content.
             | 
             | You're moving the goal posts. I am not saying "IPFS means
             | we don't need the internet archive". We absolute do need
             | the internet archive. Content addressing helps by making
             | archiving transparent, so the archival copy is not worse
             | than the original.
             | 
             | Fundamentally, consumers producers or archivists may be the
             | party most interested in the continued existence of some
             | information at different moments in the lifespan of that
             | information. Location-based addressing forces the producers
             | to shoulder the burden of hosting, but content-based
             | addressing allows the work to be distributed among those 3
             | however we see fit. Of course the burdened must still be
             | barred! That doesn't mean the flexibility isn't extremely
             | useful.
             | 
             | > On top of hosting and addressing the Internet Archive
             | offers the ability to view old content on modern browsers
             | even if modern browsers have 0 support for such content
             | anymore (or if browsers ever had support at all even).
             | Forward compatibility isn't something solved by a protocol.
             | 
             | Yeah that's great too, and again not something I am arguing
             | is not good, or not necessary.
        
           | ignitionmonkey wrote:
           | I made something very similar, though instead of age, it
           | enforces page authorship for links (using PGP signatures).
           | https://webverify.jahed.dev/
        
         | btrettel wrote:
         | A full-text search for the Wayback Machine would be my top
         | feature request. It's not uncommon to lose the URL of a site
         | and for active webpages to not have the URL of the old website.
         | Plus I'm sure there are many interesting archived webpages I
         | could find with a full-text search.
         | 
         | I understand they've tried this or things like it a few times
         | but they haven't ever kept the feature.
        
           | gregsadetsky wrote:
           | I imagine that the costs to run this would outweight the
           | potential marketing benefits, but it'd be amazing to see
           | Algolia take this project on to benefit everyone.
           | 
           | The Wayback Machine's data is ~20PB..? What is approximately
           | the size of the indexable text (i.e. the text content of html
           | pages, sans tags)? And what would the index size be like,
           | approximately?
           | 
           | I imagine that creating (and maintaining, of course) the
           | index would be the most time-consuming part? Is it at all
           | possible to imagine hosting this index... somewhere... and
           | doing sqlite http range-like queries on it..?
           | 
           | Would it be enough to have an index consist of a list of
           | found words, and the related "document ids"? i.e. "apple" is
           | in doc ids 1000, 2000, 3000, "banana" is in doc ids 2000,
           | 4000, etc.?
           | 
           | And have separate docid -> archive.org url mapping?
        
         | adventured wrote:
         | > The feature I most want from the Internet Archive
         | 
         | The feature I most want from IA is a streamlined system to
         | delete content they have archived on domains that I own,
         | including a proper privacy law compliance effort on their part.
         | They have intentionally made it a difficult, manual process to
         | get content removed. They operate as a de facto malicious
         | crawler.
         | 
         | They massively violate GDPR with how they operate and few seem
         | to care about that fact, including all the commenters on HN
         | (which universally give them a free pass on being malicious and
         | violating GDPR very aggressively).
         | 
         | When IA has to comply with laws like GDPR, that's the end of
         | IA.
        
           | pjc50 wrote:
           | > When IA has to comply with laws like GDPR, that's the end
           | of IA.
           | 
           | Will you be happy when you've burned down that library?
        
           | Mezzie wrote:
           | I can't speak to GDPR specifically because I'm not European,
           | but a fair amount of laws have leeway for preservation
           | purposes. (For example, section 108 of the Copyright Act in
           | the US functionally exempts archives from being punished for
           | copying provided they are doing so for preservation
           | purposes).
           | 
           | There are very good reasons that archives will not destroy or
           | alter information outside of very clear difficult and manual
           | processes.
           | 
           | And actually, looking at it, I don't think they're
           | necessarily in violation of GDPR [0].
           | 
           | Point 3 says: "Where personal data are processed for
           | archiving purposes in the public interest, Union or Member
           | State law may provide for derogations from the rights
           | referred to in Articles 15, 16, 18, 19, 20 and 21 subject to
           | the conditions and safeguards referred to in paragraph 1 of
           | this Article in so far as such rights are likely to render
           | impossible or seriously impair the achievement of the
           | specific purposes, and such derogations are necessary for the
           | fulfilment of those purposes."
           | 
           | According to GDPR, national law of EU parties overrules GDPR
           | when it comes to personal data being used in archival
           | context. I don't know every EU country's stance, but most of
           | the bigger economies would allow for this.
           | 
           | There is also a difference between deleting the data and
           | rendering it inaccessible to the public. Keeping something
           | under wraps is generally more 'acceptable', but active
           | destruction of the item (digital or not) and its providence
           | is much more limited. Also there's a difference between
           | personally identifying data (covered by GDPR), your content
           | (which would be covered under copyright and not GDPR), and
           | connections people can make if that content is available (not
           | covered at all because it's not anybody else's issue if you
           | write something terrible and people keep recognizing you over
           | it so long as you did actually write it).
           | 
           | [0] https://gdpr-info.eu/art-89-gdpr/
        
           | msla wrote:
           | https://web.archive.org/web/20200813235643/http://slawsonand.
           | ..
           | 
           | > Article 3(2), a new feature of the GDPR, creates
           | extraterritorial jurisdiction over companies that have
           | nothing but an internet presence in the EU and offer goods or
           | services to EU residents[1]. While the GDPR requires these
           | companies[2] to follow its data processing rules, it leaves
           | the question of enforcement unanswered. Regulations that
           | cannot be enforced do little to protect the personal data of
           | EU citizens.
           | 
           | > This article discusses how U.S. law affects the enforcement
           | of Article 3(2). In reality, enforcing the GDPR on U.S.
           | companies may be almost impossible. First, the U.S. prohibits
           | enforcing of foreign-country fines. Thus, the EU enforcement
           | power of fines for noncompliance is negligible. Second,
           | enforcing the GDPR through the designated representative can
           | be easily circumvented. Finally, a private lawsuit brought by
           | in the EU may be impossible to enforce under U.S. law.
           | 
           | [snip]
           | 
           | > Currently, there is a hole in the GDPR wall that protects
           | European Union personal data. Even with extraterritorial
           | jurisdiction over U.S. companies with only an internet
           | presence in the EU, the GDPR gives little in the way of tools
           | to enforce it. Fines from supervisory authorities would be
           | stopped by the prohibition on enforcing foreign fines. The
           | company can evade enforcement through a representative simply
           | by not designating one. Finally, private actions may be
           | stalled on issues of personal jurisdiction. If a U.S. company
           | completely disregards the GDPR while targeting customers in
           | the EU, it can use the personal data of EU citizens without
           | much fear of the consequences. While the extraterritorial
           | jurisdiction created by Article 3(2) may have seemed like a
           | good way to solve the problem of foreign companies who do not
           | have a physical presence in the EU, it turns out to be
           | practically useless.
        
           | gojomo wrote:
           | Shouldn't it be hard to delete things from a library and
           | historical-archive?
           | 
           | If you had to choose between the GDPR, & an accurate
           | historical record, which would you prefer?
        
             | Mezzie wrote:
             | We won't. Pretty much the only way things leave archives of
             | record (which is what IA is trying to be) are through Acts
             | of God (if the archive burns down/all of IA's servers are
             | taken out in a mass alien EMP attack).
             | 
             | An author suggesting that the LoC remove their copy of a
             | book/other work (including digital works) because they want
             | to unpublish it would not fly.
             | 
             | The parent comment has an issue with anything on the Web
             | not hidden in some way being considered 'public' and
             | 'published', but that would be something that would require
             | international cooperation to hash out.
        
         | toomuchtodo wrote:
         | It costs the Internet Archive $2/GB to host content in
         | perpetuity. They have a tool, Archive It, that will
         | periodically crawl your site for archival purposes if you are
         | not technical.
         | 
         | For my needs, I run a report monthly for the content I've
         | archived using my IA account to determine archived GBs, and
         | then donate the amount needed to cover those costs.
         | 
         | Consider reaching out to their patron services email address
         | with any questions.
         | 
         | Edit: $2/GB citation: https://help.archive.org/hc/en-
         | us/articles/360014755952-Arch...
        
           | tablespoon wrote:
           | > It costs the Internet Archive $2/GB to host content in
           | perpetuity.
           | 
           | Do you have source/more info than that?
           | 
           | Lets say the internet archive is 100 PB [1], that's
           | 100,000,000 GB [2], and at that rate it comes out to $200
           | million [3] for the whole thing forever. That's a lot of
           | money, but also a lot less than I was expecting for something
           | like that.
           | 
           | [1] https://www.protocol.com/internet-archive-preserving-
           | future: "The web archive alone is about 45 petabytes -- 4,500
           | terabytes -- and the Internet Archive itself is about double
           | that size (the group has other collections, like a huge
           | database of educational films, music and even long-gone
           | software programs)."
           | 
           | [2] https://www.google.com/search?q=100+petabytes+to+gb: "100
           | petabyte = 1e+8 gigabytes"
           | 
           | [3] https://www.google.com/search?q=1e%2B8+*+%242: "1e+8 *
           | (US$ 2) = 200 million US$"
        
             | majou wrote:
             | Re: [1] 45 PB = 45,000 TB
        
             | kaashif wrote:
             | I love that you have a citation for 2 * 100 million = 200
             | million. I've seen papers where the use of "=" sweeps a lot
             | of non-trivial equalities under the rug, but this is the
             | first time I've seen something go the other way this far.
             | 
             | I suppose the claim is rather shocking and warrants
             | citations - $200m to host the entire internet archive
             | forever? I don't blame you for the excessive citation.
        
           | rmbyrro wrote:
           | $2/GB in perpetuity is really cheap. They should write a
           | paper about how they did it, if they haven't already.
           | 
           | Edit: I'm assuming they can deliver reliability and
           | durability similar to modern cloud standards, like AWS S3.
        
             | toomuchtodo wrote:
             | Self built storage nodes and software for storage system,
             | with an expectation that storage costs continue to decline
             | per GB into the future.
        
           | ineedasername wrote:
           | Is that $/GB per year?
        
             | toomuchtodo wrote:
             | Last I read, that was one time [1]. Of course if you have
             | the means, consider a bit more per GB.
             | 
             | [1] https://help.archive.org/hc/en-
             | us/articles/360014755952-Arch...
        
           | Buttons840 wrote:
           | I'd pay twice that to host my static blog with IA. ;)
           | 
           | I guess archiving a static blog is already trivial for their
           | system, but I'd pay $100 a year to have IA host my static
           | blog. The overhead I would consider as a donation to a worthy
           | cause.
           | 
           | Then again, I can just give them $100 a year and find some
           | free static hosting, like GitHub Pages, and call it a day.
        
         | lepouet wrote:
         | I agree, i'm a passionate photographer and i could pay good
         | money to know that my pictures could be seen long time after my
         | death. Maybe startups exists that do this, but they will die, i
         | need something with enough critical mass that i can trust.
        
           | osigurdson wrote:
           | Perhaps have a look at Arweave (https://www.arweave.org).
        
             | boarnoah wrote:
             | That doesn't address the concern re: something needing
             | critical mass to increase its chance of survival over a
             | longer term.
             | 
             | Really the way I see it outside a few large banking firms,
             | its kind of hard to be sure any provider of digital
             | services would be around in the 50+ year term for this kind
             | of public archive.
             | 
             | I hope the Internet Archive manages it.
             | 
             | EDIT: I do worry the IA has a bit of a lightning rod effect
             | with skirting issues re: legality of archiving content. IMO
             | its no guarantee it survives any significant time span
             | either.
        
       | soheil wrote:
       | Interesting use cases for IA is to go back to early days of when
       | a startup first launched to see their pitch before they raised
       | real money and added tons of nonsensical bs to their homepage.
       | 
       | eg.
       | 
       | http://web.archive.org/web/20180901110658/https://www.snowfl...
       | 
       | http://web.archive.org/web/20140701061721/https://databricks...
        
       | ordinaryradical wrote:
       | Imagine someone building this for SaaS hosting--a perma-Heroku,
       | or something like it. That's actually a huge value-add. Suddenly
       | tinyStartupA doesn't need to convince largeCorpB that it's going
       | to be around for forever. The service can exist in perpetuity
       | without the company.
       | 
       | Complex repercussions obviously around acquisition, IP, and other
       | business dimensions however. Maybe unworkable even. But I think
       | there's a world where this actually exists and lowers the barrier
       | to building business-critical software and selling to companies
       | that need a 50-year commitment to risk you.
        
         | jayd16 wrote:
         | I wonder if its worth it for these platforms (like Heroku) to
         | simply add a donation portal. It's not as future proof as
         | something fully open, but it wouldn't need the design,
         | implementation and maintenance of a brand new fully open PAAS
         | ecosystem.
        
       | ineedasername wrote:
       | I don't understand why Carmack thinks blockchain should be a
       | component of this. Anyone care to elaborate on how that would
       | make this easier/better?
        
         | k4ch0w wrote:
         | I think he's referring to something like IPFS.
         | 
         | https://en.wikipedia.org/wiki/InterPlanetary_File_System
         | 
         | http://ipfs.io
         | 
         | You can put the storage costs on the nodes because storage at
         | archive.org's scale adds up, especially when it's run by
         | volunteers.
        
           | ineedasername wrote:
           | It looks like you could use IPFS to accomplish this without
           | using a blockchain.
        
             | [deleted]
        
             | zingplex wrote:
             | Isn't IPFS pretty closely tied with FileCoin?
        
         | ketzo wrote:
         | "Immutable and existing in perpetuity" are good qualities for
         | an archive service, and that's at least the _idea_ with a
         | blockchain.
        
           | thomasikzelf wrote:
           | It is interesting though, what happens if you put so much
           | data into a blockchain. I guess only a couple of nodes would
           | want to verify the validity of the chain (because you need
           | all the data to do it). And would those nodes really be more
           | likely to keep the data then the situation we are in now?
           | 
           | I guess after some time the nodes would agree on the hash and
           | throw away the data because it would cost too much to store.
        
         | Jonovono wrote:
         | .... He says it right after "to make internet applications that
         | could outlive companies". If it's on the blockchain it doesn't
         | matter if the company storing all of the archives shuts down,
         | the content would still exist, forever, until their is a
         | network running the chain. I suppose something like Torrent
         | could be used?
        
           | Mezzie wrote:
           | He has a point, but it's that private companies shouldn't
           | archives of record.
           | 
           | I actually think using blockchain for things like ensuring
           | providence is interesting, since in archives being able to
           | have a clean record of what happened to a piece is VERY
           | useful. It just won't earn a ton of money, so we'll need to
           | wait for the capitalism to burn off to see more not-for-
           | profit uses.
        
         | hbgl wrote:
         | I stopped reading when blockchain was mentioned.
        
           | Jonovono wrote:
           | Why?
        
             | asdfasjdhfauehf wrote:
             | because it shows a lack of understanding of the basics of
             | distributed computing. specially on top of the web we have
             | today (which was how the thread started "IA as a default
             | host-of-records" which implies said records must be
             | reachable by a any tech illiterate lawyer today)
             | 
             | car analogy time: It is the same as reading a post about
             | "how to lift my car to do work in the garage", and the the
             | second paragraph starts with "using energy harvested from
             | my perpetual motion machine"
        
           | noizejoy wrote:
           | I sincerely hope that we're only witnessing
           | 
           | "Any sufficiently long Internet discussion will propose
           | blockchain as a solution."
           | 
           | rather than
           | 
           | "Blockchain is eating the world."
        
           | ineedasername wrote:
           | I thought it was otherwise a reasonable idea, but yes-- it
           | put me off a bit when he mentioned blockchain _without
           | further elaboration_.
           | 
           | I see blockchain as a technology that may develop useful
           | applications, but-- in terms of current day usage-- I'm
           | extremely skeptical when it's referenced in conjunction with
           | applications that might achieve the same goals without it.
        
         | gls2ro wrote:
         | I dont know why he suggest this but while I would like to read
         | some old content that now cannot be found I am also respectful
         | for people that dont want their content on the internet
         | anymore.
         | 
         | So every time someone suggest to put some content on a
         | blockchain I wonder if they realize that there are people that
         | want to erase/remove their content from the internet. I also
         | think it is dangerous to keep everything someone or some
         | company created on the internet. It is too easy now to internet
         | judge some adult about things they did while being young or to
         | keep people accounted for mistakes they did and paid for them
         | their duties to society.
         | 
         | I think if we ever build this feature on a blockchain I hope it
         | is opt-in and people realize what that mean.
        
         | ricardobayes wrote:
         | Similarly how github is a blockchain. I think he means the ease
         | of version control by this.
        
           | fs111 wrote:
           | Github is a software development tooling provider, not a
           | blockchain
        
             | rat9988 wrote:
             | For his defense, he probably meant Git and typed too
             | quickly. It's still obvious what he meant.
        
               | ineedasername wrote:
               | Honestly, Git was not at all obvious to me from that. And
               | I fully admit that it could be a failing on _my_ part not
               | to read that into his post, but nonetheless I didn 't see
               | it.
        
       | polote wrote:
       | The idea is more interesting when you think about scrapers. Take
       | any ecommerce website, there are several scrapers that download
       | all pages every hours, it would be more efficient if a provider
       | had a live copy of the website and then serve the requests to the
       | scrapers or could even send webhook.
       | 
       | A website could handle tons of scrappers without having high
       | bandwidth, only the provider will need high bandwidth.
       | 
       | The issue is that scrapers often play with cookies and dynamic
       | websites and such solution wouldn't work on these cases
        
         | atdrummond wrote:
         | Is scraper one "p" or two? My inclination would be scrapper is
         | "scrap"-er rather than "scrape"-er.
        
         | pjc50 wrote:
         | Scrapers. A scrAper is the thing you use to remove ice from
         | windscreens or scrape data from websites. A scraPPer is someone
         | who likes getting into fights or collecting scrap metal.
        
         | akouri wrote:
         | scraper*
        
         | azalemeth wrote:
         | (in the nicest way and since your post is recent, please edit
         | s/scrappers/scrapers! It very much changed how I read it first
         | time around -- I thought you were referring to a type of failed
         | startup!)
        
           | polote wrote:
           | Thanks, I need to improve my english haha
        
       ___________________________________________________________________
       (page generated 2021-12-21 23:00 UTC)