[HN Gopher] Internet Archive as a default host-of-record for sta...
___________________________________________________________________
Internet Archive as a default host-of-record for startups
Author : bpierre
Score : 255 points
Date : 2021-12-21 16:51 UTC (6 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| mwattsun wrote:
| I think it's interesting to think about what we have lost because
| we couldn't keep everything from a 100 years ago and what society
| 100 years from now will be grateful we preserved.
|
| Off the top of my head, we lost a lot of common wisdom in dealing
| with the flu pandemic of 1918 because personal letters and most
| newspapers were not preserved. I think 100 years from now they
| might wish we had preserved more from marginal and/or world
| communities. What folk wisdom is being lost? Perhaps we need to
| expand our definition of what is worth saving.
| bananamerica wrote:
| Hard to know what will be of interest for future historians.
| Some things in which we place great value can be considered
| irrelevant, while some of our junk can become historical gold.
| boarnoah wrote:
| The mundane of today is very insightful for tomorrow's
| historians.
|
| Its fascinating when you start looking into any historical
| time period (you wouldn't even need to go far back), before a
| lot of details are educated guesses. Since no one chose to
| record the mundane in detail or it failed to preserve over
| time.
| mwattsun wrote:
| > some of our junk can become historical gold
|
| That's what interests me. For example, there's a cool
| repository of 12 step speaker meeting talks hosted in Iceland
| [1] and frankly, some of the talks are junk, but there's a
| lot of wisdom. What I find interesting is how it showcases
| how ordinary citizens talk to each other. The words they use,
| the accents, the little gems of folk wisdom contained, along
| with some uncommon stories.
|
| This will be valuable 100 years from now if, for example, you
| want to build a virtual world based in the mid to late 20th
| century and you want to get the accents correct. What phrases
| did people use? What were some common misconceptions? Maybe
| 100 years from now addiction will no longer be a problem. If
| my virtual world is to be accurate I need to know what it was
| like for ordinary people when it was a problem. Etc...
|
| [1] https://xa-speakers.org/
| bananamerica wrote:
| That's a great example, thanks. Funny enough, I got that
| insight from _Bill & Ted's Excellent Adventure_. In the
| end, the most important thing in the future was some 80s
| rock song.
| lelandfe wrote:
| An excellent place to start is talking to your parents and
| grandparents and recording their history and stories online.
| mwattsun wrote:
| That's a good suggestion I've been following
| AlexanderTheGr8 wrote:
| What compression does IA use to store websites? Using a 2x better
| compression will allow them to store 2x more websites/content.
|
| I am doing some compression research, and would love to help IA
| in any way I can. There are some amazing SOTA compression
| algorithms available now.
|
| And if IA ignores images/video, and focuses only on text, they
| can store an insane amount of websites at a very low cost.
| nur23kg wrote:
| midoBB wrote:
| hhh
| tonymet wrote:
| Is it really worth archiving ?
| maxbond wrote:
| It's difficult to know beforehand! A startup might be a total
| flop that delivers nothing of value to us now, but it might be
| a valuable datapoint for future historians to understand how
| startup culture changed and evolved. Or it may be interesting
| to future founders - I've seen some startups with perfectly
| fine ideas fail, and then a few years later, someone succeeds
| doing something very similar.
|
| This may not be the best example, but while I'm sure the
| Rosetta Stone, being a treaty, likely would have seemed like
| something worth preserving, would anyone have imagined that it
| would be the pivotal document in understanding ancient
| Egyptian? That it would be one of the most important documents
| of all time?
| riobard wrote:
| Correct me if I'm wrong, but isn't this problem the ideal use
| case for projects like IPFS? Anyone interested to preserve the
| content can join as a node to balance the load, right? And if so,
| why don't we see widespread adoption?
| jayd16 wrote:
| IPFS does the opposite, right? It doesn't guarantee the archive
| is available, which is what Carmack is asking for. Incentives
| to scale bandwidth with need already exist as long as you have
| he data at all.
|
| That is to say, IPFS doesn't help if the desire blooms after
| the nodes dry up. Things could still be lost.
| Ericson2314 wrote:
| The point of IPFS is not keep the data archived, but to
| allows users to no care who does the archiving.
|
| Concretely, this would be to skip the "many people on
| encountering a dead URL don't bother to try the internet
| archive" problem.
| ricardobayes wrote:
| There are many a projects who go open source when they fail.
| The extra step here is IA would become the A record for the
| project (at least temporarily?)
| Ericson2314 wrote:
| I was involved with planning
| https://nlnet.nl/project/SoftwareHeritage-P2P/ for just this
| reason --- hopefully we will finally be able to start work on
| it sometime too far off.
|
| Indeed the _real_ challenge of archival is not loosing the
| stuff, by making sure that people can still find the stuff.
| "Orphaned" information that no one knows exists, or is
| bothering to interact with, isn't that valuable compared to
| resources that are actively being used and still "live" in the
| culture.
|
| Of course, the archive can never serve the same amount of
| bandwidth, but the goal is a) interested parties can mirror the
| stuff they care about in a higher bandwidth / item way after
| some huge disruption c) random viewers never notice something
| going down, nor who is serving the info, but just a temporary
| drop in connection quality.
|
| Ultimately, location-based addressing is a stupid way to run
| society, needlessly fragile by baking in very property claims
| (IPs, DNS, etc.) that are incidental to the task at hand.
| Content-based addressing, with location based hints to avoid
| trying to solve really hard problems all at once, is the only
| way to make culture more robust.
| btown wrote:
| The great thing about location-based addressing is that an
| archive of the set of known locations is not subject to the
| same ownership rules as the canonical live version of those
| addresses. A document listing all Geocities URLs can be
| placed in content-addressed storage without needing
| geocities.com to be owned by the party that emplaces that
| document. And a chain can be maintained such that people are
| incentivized to remember that document into the far future.
| Coupled with archival of the actual content, you bypass the
| exclusivity of domain ownership.
|
| Of course, ensuring that there's persistence of _attention_
| as well is a tougher problem. But one only needs to look at
| sites like https://reddit.com/r/tumblr to realize that there
| is immense societal interest in "meme archaeology." Reducing
| the barriers to entry to would-be archaeologists, giving them
| a "chain" of breadcrumbs that lead to content, and building
| communities that will socially reward people for their
| archaeology work, is the best thing we can possibly do.
| Ericson2314 wrote:
| > The great thing about location-based addressing is that
| an archive of the set of known locations is not subject to
| the same ownership rules as the canonical live version of
| those addresses.
|
| Erm, to me this sounds like putting up with link rot as
| hack around bad IP law? There are already IP exceptions for
| preservation. And if content-addressing was the norm,
| geocities-type sites might bow to market pressure to not
| "own" the content, but merely have some some sort of
| license for being the exclusive pinning service and running
| the ads or whatever. This is like avoiding the problem
| where your the rent on your current apartment doesn't fall
| as much as the market writ large because your landlord
| knows moving is not free.
| thesausageking wrote:
| IPFS only does addressability, it doesn't provide storage. You
| could use a decentralized storage network like Arweave,
| Filecoin, or Sia.
|
| https://www.arweave.org/
|
| https://www.filecoin.com/
|
| http://sia.tech/
| cle wrote:
| Or "centralized" ones like Fleek, Textile, Pinata, etc.
|
| https://fleek.co/hosting/
|
| https://docs.textile.io/buckets/
|
| https://www.pinata.cloud/
| [deleted]
| dannyobrien wrote:
| It is -- and Brewster Kahle and the Archive have been thinking
| about this for a long while (see this talk from him five years
| ago: https://archive.org/details/LockingTheWebOpen_2016 ). The
| model you can think for this would be to have as the Archive as
| the "node of last resort" of content-addressable storage,
| making sure there's always one node up with the content you
| want.
|
| The incentive challenges are making sure that the average
| number of nodes is _more_ than one, because, as Brewster likes
| to say, "libraries burn; it's what they do", plus all the
| traditional challenges of maintaining a commons at high levels
| of resilience. Once you have data on a network like IPFS, we
| can use a number of incentive models to make sure it stays
| there, including charitable projects like the Archive,
| government support (archives are traditionally state projects
| -- if every country's archive was pinning this content, it
| would be far more resilient), and decentralized incentive
| frameworks like Filecoin.
|
| (Disclosure: I work for the Filecoin Foundation; in our
| decentralized preservation work, we've funded the Internet
| Archive's work in this area, though I should emphasise that IA
| works with a lot of different decentralizing technologies
| through their https://getdweb.net/ community.)
| nickdothutton wrote:
| I wrote a little about this and the alternatives here, especially
| for sites with user-generated content which may be impossible to
| find/regenerate from elsewhere.
|
| I call it "Beating the Samson Option", of pulling the temple down
| upon your head.
|
| https://blog.eutopian.io/beating-the-samson-option/
| pjc50 wrote:
| I'm old enough to remember when the host-of-record for failed
| startups was fuckedcompany.com ...
|
| I do wonder how many startups actually _want_ to be archived,
| rather than just ditch everything with unseemly speed as soon as
| they get acquishutdown.
| monkeydust wrote:
| There's huge valuable learnings to be had in failure. Through
| some coordination they could perhaps get compensation for
| sharing although this goes against it all being open and free.
| kodah wrote:
| Personally, I think eternally archiving everything and infinitely
| available public data has been not-so-great. If this was an
| "archive with consent" sort of system, then sure. My response may
| be better summarized as, "Does IA support robots.txt, and if not
| why?"
| rectang wrote:
| Throughout human history, records have been forgotten,
| rewritten, changed, mutated, degraded, eroded away to
| nothingness. "The internet is forever" has always struck me as
| inhumane. Make a mistake or expose a weakness on the internet
| and it will always accompany you.
|
| It turns out that the internet is not always forever. I find
| that comforting.
| treesknees wrote:
| Depends on how you think of "forever". If you post an
| embarrassing video and someone saves and reposts it with your
| name attached, odds are that video isn't going to be around
| in 200 years. But what about the next 10, 20 or 40 years? In
| the context of your overall professional adult life, that's a
| long time.
|
| "The Internet is forever" isn't some natural law by which all
| content abides as though it can never disappear. It's a
| warning that you don't control the content once it's
| accessible on the Internet.
| symlinkk wrote:
| Maybe humans should become more accommodating of past
| mistakes.
| _jal wrote:
| Suggest that to them. I'm sure they'll get right on it.
| oconnor663 wrote:
| I think there's inhumanity of a kind on both sides of this
| question. "Everything you've done will be forgotten, and no
| one will remember your name" is the sort of thing the bad
| guys say in movies. But that's what happens to most of us in
| the end. I think it's natural not to want that.
| garaetjjte wrote:
| You're consenting by posting it on public internet in the first
| place.
| ghaff wrote:
| Posting something on the public internet is not consent for
| you to scrape it and post it on your own site forever.
|
| And requiring an explicit opt-in would basically mean no IA.
|
| To be clear, the IA is a positive, maybe even a great one.
| But it skirts by because most people don't care. (They did as
| you say post whatever on the public internet.) Add the facts
| that they're a non-profit, aren't trying to monetize their
| hosting, and will generally take things if the owner asks.
|
| Libraries and other archives have some very limited special
| rights (which mostly relate to making physical backups of
| physical books). But invoking "library" isn't some general
| get out of jail free card with respect to copyright.
| blackearl wrote:
| People have recorded others without their consent for
| millennia. Whether it's telling a story about what someone
| said, a photo, or now a screenshot of a twitter post, that
| is reality. You'll never be able to stop someone from
| telling another that you said X.
| _jal wrote:
| > Posting something on the public internet is not consent
| for you to scrape it and post it on your own site forever.
|
| It effectively is. Your consent is not required, and people
| are doing far worse than just keeping it available
| (Clearview; there are also reports of people hoovering up
| encrypted data to crack in the coming decades when we're
| post-quantum).
|
| This is no different than demanding people not keep track
| of anything else, and attacking archive.org might make you
| feel better, but that won't make anyone else stop.
| pessimizer wrote:
| It can be. The reason libraries and other archives have
| special rights is because they fought for them against the
| express wishes of people who sold paper. There are no
| arguments made against archive.org that weren't also made
| against libraries.
| Mezzie wrote:
| Thank you.
|
| These rights are also under constant attack: It's normal
| to charge libraries exorbitant prices for digital
| materials compared to their analogue counterparts, for
| example.
| netrus wrote:
| Do you think most people who post publicly on the internet
| would agree if asked? I think most people would like to have
| a choice to make old stuff disappear. If most people think
| so, THAT should be the rule. You might not like that and
| argue that there is no way to enforce it, but that does not
| mean it is a good rule to assume consent.
| ghaff wrote:
| There are two things:
|
| 1.) Most people won't opt-in because a significant majority
| accept defaults and don't opt into most things.
|
| 2.) For people like yourself probing a bit deeper, you
| might well ask whether you really want to give up your
| ability to decide you don't want something you thought was
| so funny when you wrote it at 20 now that you're a
| politician running for office or up for a political
| appointment.
| kodah wrote:
| I mean even honoring the opt-out of robots.txt would be
| fantastic. As another commenter pointed out they
| willfully ignore it:
| https://blog.archive.org/2017/04/17/robots-txt-meant-for-
| sea...
|
| It's fairly unethical.
| BeFlatXIII wrote:
| How about this? Archives will still exist regardless of
| consent. Does that makes them digital rapists?
| FrostKiwi wrote:
| Public data is... public. No one should be stopped from saving
| a public page and nothing should stop Internet Archive, be it
| robot or human. There is if course a need for removal of
| archived content infringing on someone's rights, whatever that
| might be, but "archive with consent" will fail for the goal of
| preserving culture. I think it's worrying that some online
| newspapers enacted archive blockers or IA needing DMCA
| excemptions, just so companies can't DMCA anything with their
| name on it. To preserve journalistic integreity and to save
| culture, even if it collides intellectual property rights,
| "archive with consent" won't cut it.
| orev wrote:
| Content posted on a web site is NOT "public" (domain), it is
| (in the US) automatically copyrighted to the author, unless
| they specifically waive those rights. Just because you can
| see it through a browser doesn't in any way mean you can make
| it yours and do what you want with it.
| ImprobableTruth wrote:
| We store books in public libraries even if they aren't
| public domain and authors can do nothing to prevent them
| from doing so.
| ghaff wrote:
| First sale doctrine in the US. If I buy a physical book,
| I can give it, loan it, throw it in the trash, etc. This
| doesn't apply to making electronic copies--see what
| happened with Google Books for example.
| rat9988 wrote:
| If you buy the book, you have an authorization to do so.
| But you can't put copies of it.
| Mezzie wrote:
| You do if you're an archive, actually.
| FrostKiwi wrote:
| Absolutely true. "unless they specifically waive those
| rights" - if archiving entailed contacting the owner with a
| legal archive request, we would have archived basically
| nothing. Luckily there are exceptions for Internet Archive
| in place. My point is, if "by consent" was the requirement
| to archive information, we would have archived nothing.
| Mezzie wrote:
| It doesn't matter.
|
| Archives are exempt from being forbidden to create copies
| due to copyright infringement. The Library of Congress can
| make all the copies it wants, it just can't SELL them.
|
| Now, there is a question whether a private company should
| legally be able to BE an archive of record, but as of now
| there's no legal reason they can't be, I believe. So it's
| legal.
| 1MachineElf wrote:
| There is some public discussion about why IA does not strictly
| adhere to robots.txt:
|
| https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
| kodah wrote:
| They're basically saying they're choosing to ignore a web
| convention that explicitly states that people don't want
| their websites archived or searchable because they want them
| to be. Sounds pretty unethical to me.
| Ericson2314 wrote:
| There are clearly some things we, and the author, wants to
| persist, and yet the internet fails to do so.
|
| Trying to muddle in privacy concerns with the accumulation of
| public knowledge undermines the whole concept of a shared
| society, or there being even the potential of accumulating
| "progress" in the first place.
|
| Also, historically speaking, people with means used to save
| their letters for posterity, which proved to be a very valuable
| resource for future academics, so the idea of, what, deleting
| all your proton mails and signal messages as encouraged is
| arguably overshooting the return to some pre-internet norm.
| kodah wrote:
| > Also, historically speaking, people with means used to save
| their letters for posterity, which proved to be a very
| valuable resource for future academics
|
| Your example is an example of _choice_ or consent. They also
| had the option to burn their hand written books and scrolls
| down periodically. Systems like IA take that choice away.
| Ericson2314 wrote:
| There is no reason to do person communication with
| staticish websites. Don't foist the problems of social
| media onto the Internet Archive.
| kodah wrote:
| I'm not sure I understand. robots.txt was a matter of
| consent and was standardized well over twenty years ago.
| The IA willfully ignores it because they believe it
| interferes with their mission. This was long before
| social media, when static websites were more dominant
| than dynamic ones.
|
| Source: https://blog.archive.org/2017/04/17/robots-txt-
| meant-for-sea...
| luckylion wrote:
| "You shouldn't have privacy because it undermines the concept
| of a shared society, and also future historians might find
| your life interesting."
|
| Thanks.
| Ericson2314 wrote:
| Rediculus Strawman
| thrdbndndn wrote:
| I genuinely didn't understand the point he's trying to make.
|
| Could someone ELI5?
| 1vuio0pswjnm7 wrote:
| Why is there only one IA?
|
| Why is IA not globally distributed, like a CDN?
|
| I use IA for "problem" websites, e.g., ones that rely on SNI,
| i.e., ones hosted at certain CDNs. I simply add add these sites
| to a list and the local proxy does the rest.
| http-request set-uri
| https://web.archive.org/web/1if_/http://%[req.hdr(host)]%[pathq]
| if { hdr(host) -m str -f list }
|
| IA "hosts" an enormous number of sites without the need for SNI
| (plaintext hostnames sent over the wire).
|
| EDIT: @sebow the way they (re)format the HTML is less friendly to
| the text-only browser I use.
| thrdbndndn wrote:
| > 1if_
|
| This is a neat shortcut to simply get the very first archived
| version! I often have to go to /*/ and manually click on one of
| them, which is very tiring.
|
| Is there one to get the latest?
| sebow wrote:
| amenghra wrote:
| What's your issue with SNI/threat model? If you use a non-SNI
| site, anyone can tell which site you are visiting since there's
| only one domain on that IP.
| 1vuio0pswjnm7 wrote:
| There is a difference between making something "impossible"
| and making something "easier". Performing reverse DNS
| lookups, or otherwise trying to maintain a global table of
| 1:1 domain:IP mappings and perform lookups in real-time, is
| nowhere near as easy nor reliable as sniffing SNI. IME, it is
| neither easy nor reliable, nor worth the effort. SNI is the
| preferred method. SNI is easier. SNI is 100% reliable for
| detecting what hostname the user is trying to access.
|
| What is the point of so-called "DNS privacy/Private DNS" if
| "anyone can tell which site you are visiting" simply by
| observing IP addresses, without any need to see domainnames.
|
| If SNI (plaintext hostnames sent over the wire) is a non-
| issue, then why are people working on encrypted Client Hello
| in TLS1.3.
| yellowapple wrote:
| Seconding that curiosity. Pretty much every web server I've
| built uses SNI (at least if it's hosting sites under multiple
| domains), and the only "downside" of which I'm aware is the
| lack of IE6 support.
| zaps wrote:
| I was with him up until "blockchain".
| markjgraham wrote:
| Hi,
|
| I manage the Wayback Machine at the Internet Archive.
|
| Very happy so many people here care about preserving, and making
| available, our cultural heritage!
|
| Please know a dedicated, and talented, team of engineers works
| every day to do a better job of archiving more of the public Web,
| and making it available via the Wayback Machine.
|
| As noted the Internet Archive is experimenting with filecoin.io
| and storj.io and is always open to suggestions about how we might
| do our jobs better, and improve our service. We also host regular
| meetups (and have hosted summits and a camp) related to the
| Decentralized Web. See: https://blog.archive.org/tag/dweb/
|
| The Internet Archive also offers archive-it.org, a subscription
| service, for those who want a higher level of support and more
| features.
|
| We appreciate any support you can offer, financial and otherwise.
| Please share any bug reports, feature suggestions and other
| feedback with us via email to info@archive.org
|
| Oh/and... checkout the new PDF Search feature we just launched at
| the bottom of web.archive.org. More to come like that in 2022.
|
| Finally, you might also find some of the things I wrote here of
| interest: https://gijn.org/2021/05/05/tips-for-using-the-
| internet-arch...
| a-dub wrote:
| it sounds like he wants to be able to archive and respinup/run
| things like mmoprg game servers as easily as static content can
| be archived and served today.
|
| that would be a huge and expensive paradigm shift for internet
| service backends that traditionally have been only designed to be
| run by one entity and typically are a mix of custom, open source
| and proprietary software that is run in a specific way.
|
| i suppose it could be done, but there hasn't been any reason to
| make that investment. service backends also tend to be more
| "living software" where part of the system is the team that
| continuously builds, updates and operates it.
|
| basically, it would look something like java applets, but for
| entire internet service backends. one step and all the services,
| databases, everything would spin up and start serving. that would
| be great but is probably a ways out.
| jayd16 wrote:
| Well there's other static things like FAQs, news feeds, even
| hosted game content that a game might rely on. It's not just
| dynamic services. I think Carmack is referring to that.
| a-dub wrote:
| ..."something like this could combine with a blockchain style
| technology to make internet applications that could outlive
| companies. A niche multiuser game that couldn't meet company
| revenue goals could still be "fed" by anyone that wanted to
| push resources at it, since the "
|
| that's not static assets, that's a multiuser game server.
|
| i think he's envisioning entire internet service backends
| that can be packaged up like java applets and re-run on
| demand paired with some kind of decentralized serving
| infrastructure that any user can insert coins and resurrect a
| sophisticated web service from the past.
|
| more likely i suspect we'll see more efforts by hobbyists to
| resurrect these things and more releases of backends from
| failed projects into the public domain.
|
| with so much physical gear that requires service backends
| being made today, we may even see regulation that requires
| release of the source for a service when a service is shut
| down. crazy to think that if the company who made your car or
| tractor fails, that your perfectly good car or tractor could
| cease function when they shut down the service backend.
| 908087 wrote:
| msla wrote:
| I'd love it if the Wayback Machiine were less touchy/more
| reliable.
|
| The failure mode I see very often is that the frontend apparently
| doesn't know what the backend's doing: The part which ingests
| URLs and tells you what URLs have been archived does not know
| what archives the backend has, so it will tell you a page has
| been archived and give you a link to the archive, but when you
| click the link, it tells you it does _not_ have the page
| archived, oh, look, it exists online, would you like to archive
| it now? Archive it again, and it will tell you that you can only
| archive a page once every 45 minutes. If you 're a weird little
| obsessive like myself, you go through this process a half-dozen
| times for one page before it acknowledges that, yes, it does have
| the page archived ( _once_ , mind you) and you can actually see
| it.
|
| While I'm filing bitch reports...
|
| The Wayback Machine apparently loves setting cookies. It will set
| cookies until it has exceeded its own ability to _accept_
| cookies, at which point it will give you a blank page and you
| have to look in the developer console to figure out that it sent
| you a "too many cookies" error in the response header. I've had
| to force my browsers to not accept any cookies from the Internet
| Archive to fix this.
| alberth wrote:
| >>" I wonder if there could be a world where the IA acts as a
| default host-of-record for startups, with a super-easy CDN
| relationship such that the content"
|
| Doesn't IA already partner with Cloudflare to do exactly what
| Carmack is suggesting.
|
| https://blog.cloudflare.com/cloudflares-always-online-and-th...
| magila wrote:
| You still need to provide a separate origin server (i.e. host-
| of-record) when using Always Online. AO is designed to use IA
| as a backstop when your origin server goes down.
| kderbyma wrote:
| I love this idea. I am working on a mini project for a
| decentralized game which can be self hosted or snapshots - so you
| can share / modify a portion of it and provide a custom
| experience
| dleslie wrote:
| There's nothing about his suggestion that requires use of a
| Blockchain. IA could certify authenticity easily without it.
| simonw wrote:
| The feature I most want from the Internet Archive is the ability
| to donate them an old domain name and enough cash to renew it for
| the next hundred years such that they can keep an archived
| version of a site available (without breaking any incoming links)
| for a very long time.
|
| They would also need to be able to handle legal administration
| costs of things like DMCA take-down notices, but I assume they
| already have to deal with that for the rest of the archive so
| hopefully that's not an extra complexity for them.
| EGreg wrote:
| Why not use MaidSAFE or IPFS for that?
| hluska wrote:
| That doesn't solve the 'everyone you know is dead' problem.
| frakkingcylons wrote:
| With regards to IPFS, you'd need to find a pinning service
| that offers long-term contracts. Moreover, I have more
| confidence in the Internet Archive in being around in 10, 20,
| 30 years compared to any existing pinning service. That's not
| meant to be a slight against said pinning services, just that
| the IA is well established in their role.
| EGreg wrote:
| Check out Freenet
|
| They drop stuff that has the least amount of accessing, so
| if you want to keep something online you have to pay
| someone to keep accessing it. Makes sense, but I'd argue
| that it should be a market, meaning the price should go up
| as more people try to access stuff, but anyone can start to
| seed popular content and collect revenues also for hosting
| it (this is better than wasting electricity on accessing
| stuff or doing proof of work).
|
| But isn't FileCoin exactly that for IPFS?
|
| MaidSAFE goes a step further and has nodes rebalance
| autonomously and earn the most safecoin, as something gets
| more popular it gets seeded more.
| Ericson2314 wrote:
| A far cheaper solution is a browser extension that looks up DNS
| differently based on the age of a the link.
|
| It wouldn't be hard to maintain a hand-crafted database of when
| domains are reused for something completely different, or even
| when the same conceptual website has breakages, and use that to
| choose between Internet Archive or live web accordingly. When
| one is browsing from an internet archive page, the date is
| known, when someone is browsing from a live website, heuristics
| can be used, along with "bisecting" dates when the link is
| dead.
|
| Ultimately we want more content addressing to avoid this
| problem entirely (see below), Or DNS -> PubKey, PubKey ->
| latest content, with some law that the pubkeys shall not be
| reused for unrelated things. vs DNS which is mere ephemeral
| Huffman encoding. So see below for the stuff on IPFS. But the
| trick above is a good stop-gap, and indeed the database itself
| used to back the extension could be on IPFS.
| zamadatix wrote:
| Trying to figure out the age of links referenced against a
| hand crafted database (or trying to figure out if the current
| version is "too different" based on age automatically) using
| a browser extension only serves to create an unreliable
| solution for a few using the extension. Cheaper sure but it's
| also doing a whole lot less.
|
| Alternative content addressed systems may also work better
| for finding the content than someone hosting your DNS records
| for a long time but the bulk of the problem space is in
| guaranteeing active hosting in a way viewable to viewers of
| the age will be available for many years not addressing the
| content. On top of hosting and addressing the Internet
| Archive offers the ability to view old content on modern
| browsers even if modern browsers have 0 support for such
| content anymore (or if browsers ever had support at all
| even). Forward compatibility isn't something solved by a
| protocol.
| Ericson2314 wrote:
| > using a browser extension only serves to create an
| unreliable solution for a few using the extension. Cheaper
| sure but it's also doing a whole lot less.
|
| This is rather pessimistic thinking. The same money that
| goes into buying up domains could go into lobbying browsers
| to add this functionality be default.
|
| > bulk of the problem space is in guaranteeing active
| hosting in a way viewable to viewers of the age will be
| available for many years not addressing the content.
|
| You're moving the goal posts. I am not saying "IPFS means
| we don't need the internet archive". We absolute do need
| the internet archive. Content addressing helps by making
| archiving transparent, so the archival copy is not worse
| than the original.
|
| Fundamentally, consumers producers or archivists may be the
| party most interested in the continued existence of some
| information at different moments in the lifespan of that
| information. Location-based addressing forces the producers
| to shoulder the burden of hosting, but content-based
| addressing allows the work to be distributed among those 3
| however we see fit. Of course the burdened must still be
| barred! That doesn't mean the flexibility isn't extremely
| useful.
|
| > On top of hosting and addressing the Internet Archive
| offers the ability to view old content on modern browsers
| even if modern browsers have 0 support for such content
| anymore (or if browsers ever had support at all even).
| Forward compatibility isn't something solved by a protocol.
|
| Yeah that's great too, and again not something I am arguing
| is not good, or not necessary.
| ignitionmonkey wrote:
| I made something very similar, though instead of age, it
| enforces page authorship for links (using PGP signatures).
| https://webverify.jahed.dev/
| btrettel wrote:
| A full-text search for the Wayback Machine would be my top
| feature request. It's not uncommon to lose the URL of a site
| and for active webpages to not have the URL of the old website.
| Plus I'm sure there are many interesting archived webpages I
| could find with a full-text search.
|
| I understand they've tried this or things like it a few times
| but they haven't ever kept the feature.
| gregsadetsky wrote:
| I imagine that the costs to run this would outweight the
| potential marketing benefits, but it'd be amazing to see
| Algolia take this project on to benefit everyone.
|
| The Wayback Machine's data is ~20PB..? What is approximately
| the size of the indexable text (i.e. the text content of html
| pages, sans tags)? And what would the index size be like,
| approximately?
|
| I imagine that creating (and maintaining, of course) the
| index would be the most time-consuming part? Is it at all
| possible to imagine hosting this index... somewhere... and
| doing sqlite http range-like queries on it..?
|
| Would it be enough to have an index consist of a list of
| found words, and the related "document ids"? i.e. "apple" is
| in doc ids 1000, 2000, 3000, "banana" is in doc ids 2000,
| 4000, etc.?
|
| And have separate docid -> archive.org url mapping?
| adventured wrote:
| > The feature I most want from the Internet Archive
|
| The feature I most want from IA is a streamlined system to
| delete content they have archived on domains that I own,
| including a proper privacy law compliance effort on their part.
| They have intentionally made it a difficult, manual process to
| get content removed. They operate as a de facto malicious
| crawler.
|
| They massively violate GDPR with how they operate and few seem
| to care about that fact, including all the commenters on HN
| (which universally give them a free pass on being malicious and
| violating GDPR very aggressively).
|
| When IA has to comply with laws like GDPR, that's the end of
| IA.
| pjc50 wrote:
| > When IA has to comply with laws like GDPR, that's the end
| of IA.
|
| Will you be happy when you've burned down that library?
| Mezzie wrote:
| I can't speak to GDPR specifically because I'm not European,
| but a fair amount of laws have leeway for preservation
| purposes. (For example, section 108 of the Copyright Act in
| the US functionally exempts archives from being punished for
| copying provided they are doing so for preservation
| purposes).
|
| There are very good reasons that archives will not destroy or
| alter information outside of very clear difficult and manual
| processes.
|
| And actually, looking at it, I don't think they're
| necessarily in violation of GDPR [0].
|
| Point 3 says: "Where personal data are processed for
| archiving purposes in the public interest, Union or Member
| State law may provide for derogations from the rights
| referred to in Articles 15, 16, 18, 19, 20 and 21 subject to
| the conditions and safeguards referred to in paragraph 1 of
| this Article in so far as such rights are likely to render
| impossible or seriously impair the achievement of the
| specific purposes, and such derogations are necessary for the
| fulfilment of those purposes."
|
| According to GDPR, national law of EU parties overrules GDPR
| when it comes to personal data being used in archival
| context. I don't know every EU country's stance, but most of
| the bigger economies would allow for this.
|
| There is also a difference between deleting the data and
| rendering it inaccessible to the public. Keeping something
| under wraps is generally more 'acceptable', but active
| destruction of the item (digital or not) and its providence
| is much more limited. Also there's a difference between
| personally identifying data (covered by GDPR), your content
| (which would be covered under copyright and not GDPR), and
| connections people can make if that content is available (not
| covered at all because it's not anybody else's issue if you
| write something terrible and people keep recognizing you over
| it so long as you did actually write it).
|
| [0] https://gdpr-info.eu/art-89-gdpr/
| msla wrote:
| https://web.archive.org/web/20200813235643/http://slawsonand.
| ..
|
| > Article 3(2), a new feature of the GDPR, creates
| extraterritorial jurisdiction over companies that have
| nothing but an internet presence in the EU and offer goods or
| services to EU residents[1]. While the GDPR requires these
| companies[2] to follow its data processing rules, it leaves
| the question of enforcement unanswered. Regulations that
| cannot be enforced do little to protect the personal data of
| EU citizens.
|
| > This article discusses how U.S. law affects the enforcement
| of Article 3(2). In reality, enforcing the GDPR on U.S.
| companies may be almost impossible. First, the U.S. prohibits
| enforcing of foreign-country fines. Thus, the EU enforcement
| power of fines for noncompliance is negligible. Second,
| enforcing the GDPR through the designated representative can
| be easily circumvented. Finally, a private lawsuit brought by
| in the EU may be impossible to enforce under U.S. law.
|
| [snip]
|
| > Currently, there is a hole in the GDPR wall that protects
| European Union personal data. Even with extraterritorial
| jurisdiction over U.S. companies with only an internet
| presence in the EU, the GDPR gives little in the way of tools
| to enforce it. Fines from supervisory authorities would be
| stopped by the prohibition on enforcing foreign fines. The
| company can evade enforcement through a representative simply
| by not designating one. Finally, private actions may be
| stalled on issues of personal jurisdiction. If a U.S. company
| completely disregards the GDPR while targeting customers in
| the EU, it can use the personal data of EU citizens without
| much fear of the consequences. While the extraterritorial
| jurisdiction created by Article 3(2) may have seemed like a
| good way to solve the problem of foreign companies who do not
| have a physical presence in the EU, it turns out to be
| practically useless.
| gojomo wrote:
| Shouldn't it be hard to delete things from a library and
| historical-archive?
|
| If you had to choose between the GDPR, & an accurate
| historical record, which would you prefer?
| Mezzie wrote:
| We won't. Pretty much the only way things leave archives of
| record (which is what IA is trying to be) are through Acts
| of God (if the archive burns down/all of IA's servers are
| taken out in a mass alien EMP attack).
|
| An author suggesting that the LoC remove their copy of a
| book/other work (including digital works) because they want
| to unpublish it would not fly.
|
| The parent comment has an issue with anything on the Web
| not hidden in some way being considered 'public' and
| 'published', but that would be something that would require
| international cooperation to hash out.
| toomuchtodo wrote:
| It costs the Internet Archive $2/GB to host content in
| perpetuity. They have a tool, Archive It, that will
| periodically crawl your site for archival purposes if you are
| not technical.
|
| For my needs, I run a report monthly for the content I've
| archived using my IA account to determine archived GBs, and
| then donate the amount needed to cover those costs.
|
| Consider reaching out to their patron services email address
| with any questions.
|
| Edit: $2/GB citation: https://help.archive.org/hc/en-
| us/articles/360014755952-Arch...
| tablespoon wrote:
| > It costs the Internet Archive $2/GB to host content in
| perpetuity.
|
| Do you have source/more info than that?
|
| Lets say the internet archive is 100 PB [1], that's
| 100,000,000 GB [2], and at that rate it comes out to $200
| million [3] for the whole thing forever. That's a lot of
| money, but also a lot less than I was expecting for something
| like that.
|
| [1] https://www.protocol.com/internet-archive-preserving-
| future: "The web archive alone is about 45 petabytes -- 4,500
| terabytes -- and the Internet Archive itself is about double
| that size (the group has other collections, like a huge
| database of educational films, music and even long-gone
| software programs)."
|
| [2] https://www.google.com/search?q=100+petabytes+to+gb: "100
| petabyte = 1e+8 gigabytes"
|
| [3] https://www.google.com/search?q=1e%2B8+*+%242: "1e+8 *
| (US$ 2) = 200 million US$"
| majou wrote:
| Re: [1] 45 PB = 45,000 TB
| kaashif wrote:
| I love that you have a citation for 2 * 100 million = 200
| million. I've seen papers where the use of "=" sweeps a lot
| of non-trivial equalities under the rug, but this is the
| first time I've seen something go the other way this far.
|
| I suppose the claim is rather shocking and warrants
| citations - $200m to host the entire internet archive
| forever? I don't blame you for the excessive citation.
| rmbyrro wrote:
| $2/GB in perpetuity is really cheap. They should write a
| paper about how they did it, if they haven't already.
|
| Edit: I'm assuming they can deliver reliability and
| durability similar to modern cloud standards, like AWS S3.
| toomuchtodo wrote:
| Self built storage nodes and software for storage system,
| with an expectation that storage costs continue to decline
| per GB into the future.
| ineedasername wrote:
| Is that $/GB per year?
| toomuchtodo wrote:
| Last I read, that was one time [1]. Of course if you have
| the means, consider a bit more per GB.
|
| [1] https://help.archive.org/hc/en-
| us/articles/360014755952-Arch...
| Buttons840 wrote:
| I'd pay twice that to host my static blog with IA. ;)
|
| I guess archiving a static blog is already trivial for their
| system, but I'd pay $100 a year to have IA host my static
| blog. The overhead I would consider as a donation to a worthy
| cause.
|
| Then again, I can just give them $100 a year and find some
| free static hosting, like GitHub Pages, and call it a day.
| lepouet wrote:
| I agree, i'm a passionate photographer and i could pay good
| money to know that my pictures could be seen long time after my
| death. Maybe startups exists that do this, but they will die, i
| need something with enough critical mass that i can trust.
| osigurdson wrote:
| Perhaps have a look at Arweave (https://www.arweave.org).
| boarnoah wrote:
| That doesn't address the concern re: something needing
| critical mass to increase its chance of survival over a
| longer term.
|
| Really the way I see it outside a few large banking firms,
| its kind of hard to be sure any provider of digital
| services would be around in the 50+ year term for this kind
| of public archive.
|
| I hope the Internet Archive manages it.
|
| EDIT: I do worry the IA has a bit of a lightning rod effect
| with skirting issues re: legality of archiving content. IMO
| its no guarantee it survives any significant time span
| either.
| soheil wrote:
| Interesting use cases for IA is to go back to early days of when
| a startup first launched to see their pitch before they raised
| real money and added tons of nonsensical bs to their homepage.
|
| eg.
|
| http://web.archive.org/web/20180901110658/https://www.snowfl...
|
| http://web.archive.org/web/20140701061721/https://databricks...
| ordinaryradical wrote:
| Imagine someone building this for SaaS hosting--a perma-Heroku,
| or something like it. That's actually a huge value-add. Suddenly
| tinyStartupA doesn't need to convince largeCorpB that it's going
| to be around for forever. The service can exist in perpetuity
| without the company.
|
| Complex repercussions obviously around acquisition, IP, and other
| business dimensions however. Maybe unworkable even. But I think
| there's a world where this actually exists and lowers the barrier
| to building business-critical software and selling to companies
| that need a 50-year commitment to risk you.
| jayd16 wrote:
| I wonder if its worth it for these platforms (like Heroku) to
| simply add a donation portal. It's not as future proof as
| something fully open, but it wouldn't need the design,
| implementation and maintenance of a brand new fully open PAAS
| ecosystem.
| ineedasername wrote:
| I don't understand why Carmack thinks blockchain should be a
| component of this. Anyone care to elaborate on how that would
| make this easier/better?
| k4ch0w wrote:
| I think he's referring to something like IPFS.
|
| https://en.wikipedia.org/wiki/InterPlanetary_File_System
|
| http://ipfs.io
|
| You can put the storage costs on the nodes because storage at
| archive.org's scale adds up, especially when it's run by
| volunteers.
| ineedasername wrote:
| It looks like you could use IPFS to accomplish this without
| using a blockchain.
| [deleted]
| zingplex wrote:
| Isn't IPFS pretty closely tied with FileCoin?
| ketzo wrote:
| "Immutable and existing in perpetuity" are good qualities for
| an archive service, and that's at least the _idea_ with a
| blockchain.
| thomasikzelf wrote:
| It is interesting though, what happens if you put so much
| data into a blockchain. I guess only a couple of nodes would
| want to verify the validity of the chain (because you need
| all the data to do it). And would those nodes really be more
| likely to keep the data then the situation we are in now?
|
| I guess after some time the nodes would agree on the hash and
| throw away the data because it would cost too much to store.
| Jonovono wrote:
| .... He says it right after "to make internet applications that
| could outlive companies". If it's on the blockchain it doesn't
| matter if the company storing all of the archives shuts down,
| the content would still exist, forever, until their is a
| network running the chain. I suppose something like Torrent
| could be used?
| Mezzie wrote:
| He has a point, but it's that private companies shouldn't
| archives of record.
|
| I actually think using blockchain for things like ensuring
| providence is interesting, since in archives being able to
| have a clean record of what happened to a piece is VERY
| useful. It just won't earn a ton of money, so we'll need to
| wait for the capitalism to burn off to see more not-for-
| profit uses.
| hbgl wrote:
| I stopped reading when blockchain was mentioned.
| Jonovono wrote:
| Why?
| asdfasjdhfauehf wrote:
| because it shows a lack of understanding of the basics of
| distributed computing. specially on top of the web we have
| today (which was how the thread started "IA as a default
| host-of-records" which implies said records must be
| reachable by a any tech illiterate lawyer today)
|
| car analogy time: It is the same as reading a post about
| "how to lift my car to do work in the garage", and the the
| second paragraph starts with "using energy harvested from
| my perpetual motion machine"
| noizejoy wrote:
| I sincerely hope that we're only witnessing
|
| "Any sufficiently long Internet discussion will propose
| blockchain as a solution."
|
| rather than
|
| "Blockchain is eating the world."
| ineedasername wrote:
| I thought it was otherwise a reasonable idea, but yes-- it
| put me off a bit when he mentioned blockchain _without
| further elaboration_.
|
| I see blockchain as a technology that may develop useful
| applications, but-- in terms of current day usage-- I'm
| extremely skeptical when it's referenced in conjunction with
| applications that might achieve the same goals without it.
| gls2ro wrote:
| I dont know why he suggest this but while I would like to read
| some old content that now cannot be found I am also respectful
| for people that dont want their content on the internet
| anymore.
|
| So every time someone suggest to put some content on a
| blockchain I wonder if they realize that there are people that
| want to erase/remove their content from the internet. I also
| think it is dangerous to keep everything someone or some
| company created on the internet. It is too easy now to internet
| judge some adult about things they did while being young or to
| keep people accounted for mistakes they did and paid for them
| their duties to society.
|
| I think if we ever build this feature on a blockchain I hope it
| is opt-in and people realize what that mean.
| ricardobayes wrote:
| Similarly how github is a blockchain. I think he means the ease
| of version control by this.
| fs111 wrote:
| Github is a software development tooling provider, not a
| blockchain
| rat9988 wrote:
| For his defense, he probably meant Git and typed too
| quickly. It's still obvious what he meant.
| ineedasername wrote:
| Honestly, Git was not at all obvious to me from that. And
| I fully admit that it could be a failing on _my_ part not
| to read that into his post, but nonetheless I didn 't see
| it.
| polote wrote:
| The idea is more interesting when you think about scrapers. Take
| any ecommerce website, there are several scrapers that download
| all pages every hours, it would be more efficient if a provider
| had a live copy of the website and then serve the requests to the
| scrapers or could even send webhook.
|
| A website could handle tons of scrappers without having high
| bandwidth, only the provider will need high bandwidth.
|
| The issue is that scrapers often play with cookies and dynamic
| websites and such solution wouldn't work on these cases
| atdrummond wrote:
| Is scraper one "p" or two? My inclination would be scrapper is
| "scrap"-er rather than "scrape"-er.
| pjc50 wrote:
| Scrapers. A scrAper is the thing you use to remove ice from
| windscreens or scrape data from websites. A scraPPer is someone
| who likes getting into fights or collecting scrap metal.
| akouri wrote:
| scraper*
| azalemeth wrote:
| (in the nicest way and since your post is recent, please edit
| s/scrappers/scrapers! It very much changed how I read it first
| time around -- I thought you were referring to a type of failed
| startup!)
| polote wrote:
| Thanks, I need to improve my english haha
___________________________________________________________________
(page generated 2021-12-21 23:00 UTC)