[HN Gopher] Blocking Internet Archive Won't Stop AI, but Will Er...
       ___________________________________________________________________
        
       Blocking Internet Archive Won't Stop AI, but Will Erase Web's
       Historical Record
        
       Author : pabs3
       Score  : 471 points
       Date   : 2026-03-21 07:30 UTC (15 hours ago)
        
 (HTM) web link (www.eff.org)
 (TXT) w3m dump (www.eff.org)
        
       | xnx wrote:
       | Does Internet Archive have a distributed residential IP crawler
       | program? I would enthusiastically contribute to that.
       | 
       | There must be some mechanism to prevent tampering in such a
       | setup.
        
         | progval wrote:
         | The Internet Archive does not, but Archive Team does:
         | https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior
        
           | xnx wrote:
           | Yes! I'm running an instance right now.
        
         | Retr0id wrote:
         | > There must be some mechanism to prevent tampering in such a
         | setup.
         | 
         | Trivial as long as they terminate the TLS on their end, not
         | yours. So you'd just be a residential proxy.
        
         | gzread wrote:
         | No, IA does everything above board and even honors invalid DMCA
         | takedowns.
        
       | SlinkyOnStairs wrote:
       | Devil's advocate: Anyone seeking to limit AI scraping doesn't
       | have much of a choice in also blocking archivists.
       | 
       | And it's genuinely not that weird for news organisations to want
       | to stop AI scraping. This is just a repeat of their fight with
       | social media embedding.
       | 
       | Sure. The _back catalogue_ should be as close to public domain as
       | possible, libraries keeping those records is incredibly important
       | for research.
       | 
       | But with current news, that becomes complicated as taking the
       | articles and not paying the subscription (or viewing their ads)
       | directly takes away the revenue streams that newsrooms rely on to
       | produce the news. Hence the "Newspaper trying to ban linking"
       | mess, which was never about the links themselves but about social
       | media sites embedding the headline and a snippet, which in turn
       | made all the users stop clicking through and "paying" for the
       | article.
       | 
       | Social media relies on those newsrooms (same with really, most
       | other kinds of websites) to provide a lot of their content. And
       | AI relies on them for all of the training data (remember:
       | "Synthetic data" does not appear ex nihilo) & to provide the news
       | that the AI users request. We can't just let the newsrooms die.
       | The newsroom hasn't been replaced itself, it's revenue has been
       | destroyed.
       | 
       | ---
       | 
       | And so, the question of archives pops up. Because yes, you can
       | with some difficulty block out the AI bots, even the social media
       | bots. A paywall suffices.
       | 
       | But this kills archiving. Yet if you whitelist the archives in
       | some way, the AI scrapers will just pull their data out of the
       | archive instead and the newsrooms still die. (Which also makes
       | the archiving moot)
       | 
       | A compromise solution might be for archives to accept/publish
       | things on a delay, keep the AI companies from taking the current
       | news without paying up, but still granting everyone access to
       | stuff from decades ago.
       | 
       | There's just major disagreement about what a reasonable delay is.
       | Most major news orgs and other such IP-holders are pretty upset
       | about AI firm's "steal first, ask permission later" approach.
       | Several AI firms setting the standard that training data is to be
       | paid for doesn't help here either. In paying for training data
       | they've created a significant market for archives, and
       | significant incentive to not make them publicly freely
       | accessible.
       | 
       | Why would The Times _ever_ hand over their catalogue to the
       | Internet Archive if Amazon will pay them a significant sum of
       | money for it? The greater good of all humanity? Good luck getting
       | that from a dying industry.
       | 
       | ---
       | 
       | Tangent: Another annoying wrinkle in the financial incentives
       | here is that not all archiving organisations are engaging in fair
       | play, which yet further pushes people to obstruct their work.
       | 
       | To cite a HN-relevant example: Source code archivist "Software
       | Heritage" has long engaged in holding a copy of all the
       | sourcecode they can get their hands on, regardless of it's
       | license. If it's ever been on github, odds are they're
       | distributing it. Even when licenses explicitly forbid that. (This
       | is, of course, perfectly legal in the case of actual research and
       | other fair use. But:)
       | 
       | They were notable involved in HuggingFace's "The Stack" project
       | by sharing a their archives ... and received money from
       | HuggingFace. While the latter is nominally a donation, this is in
       | effect a sale.
       | 
       | ---
       | 
       | I find it quite displeasing that the EFF fails to identify the
       | incentives at play here. Simply trying to nag everyone into
       | "doing the thing for the greater good!" is loathsome and doesn't
       | work. Unless we change this incentive structure, the outcome
       | won't change.
        
         | Obscurity4340 wrote:
         | It would be better if there was some arrangement the papers
         | could reach with Archive where they just delay the release or
         | wait a week then its part of the archive. That way, news stuff
         | gets paid for when its hot and fresh but then it gets archived
         | and the record is preserved
        
       | user_7832 wrote:
       | > But in recent months The New York Times began blocking the
       | Archive from crawling its website, using technical measures that
       | go beyond the web's traditional robots.txt rules. That risks
       | cutting off a record that historians and journalists have relied
       | on for decades. Other newspapers, including The Guardian, seem to
       | be following suit.
       | 
       | I'm a bit surprised I never read about this till now, though
       | while disappointing it is unfortunately not surprising.
       | 
       | > The Times says the move is driven by concerns about AI
       | companies scraping news content. Publishers seek control over how
       | their work is used, and several--including the Times--are now
       | suing AI companies over whether training models on copyrighted
       | material violates the law. There's a strong case that such
       | training is fair use.
       | 
       | I suspect part of it might be these corps not wanting people to
       | skip a paywall (whether or not someone would pay even if they had
       | no access is a different story). But this argument makes no sense
       | for the Guardian.
        
         | user_7832 wrote:
         | I went to Guardian's website to cross check their motto
         | (getting confused with WaPo's motto) and got served this
         | (hilarious? sad?) banner. As if blocking cross website tracking
         | is somehow bad.
         | 
         | > Rejection hurts ... You've chosen to reject third-party
         | cookies while browsing our site. Not being able to use third
         | party cookies means we make less from selling adverts to fund
         | our journalism.
         | 
         | We believe that access to trustworthy, factual information is
         | in the public good, which is why we keep our website open to
         | all, without a paywall.
         | 
         | If you don't want to receive personalised ads but would still
         | like to help the Guardian produce great journalism 24/7, please
         | support us today. It only takes a minute. Thank you.
        
           | mocd wrote:
           | The Guardian's ads asking for contributions have got
           | progressively more desperate. I find their commitment to
           | keeping their site paywall free admirable, but the current
           | almost-begging (and selling off their Sunday paper) has got
           | so intense that it feels like it's only a matter of time
           | until they introduce some kind of paid content.
        
             | ryandrake wrote:
             | Begging users to turn the tracking gun on themselves so
             | they can be bombarded with ads is totally pathetic, and
             | I've seen this on multiple news sites. These guys can't go
             | out of business fast enough.
        
           | duskdozer wrote:
           | >If you don't want to receive *personalised ads*
           | 
           | So ads, just not personalized. Remind me again why
           | personalized ads are good for me if I have to pay to have
           | non-personalized ads?
        
             | none2585 wrote:
             | I think their plea is: 'we make more money from
             | personalized ads so help us make up the difference through
             | donation (or whatever they're selling).'
        
       | gzread wrote:
       | This is why archive.is was created. Should we stop trying to hunt
       | down and punish its creator and support it as the extremely
       | useful project that it is?
        
         | philistine wrote:
         | The creator can maintain anonymity. The creator does not
         | deserve to continue being celebrated when they embarked on a
         | DDOS campaign using the traffic of archive.is against a
         | journalist trying to uncover their identity. By these actions,
         | they have shown to be capricious, vindictive, and willing to
         | ensnare their users in their DDOS of others. Whoever they are,
         | they're terrible.
        
           | gzread wrote:
           | Their life is in danger and one particular journalist is
           | making it so
        
           | choo-t wrote:
           | Well, if they deserve anonymity, they also deserve to be able
           | to protect it, and they have really few tools against a
           | doxxing, the DDOS was one of them, corrupting the archived
           | article was another, albeit dangerous for their own
           | reputation as an archiver.
           | 
           | The crux of the problem was the doxxing, not the defense
           | against it.
        
             | ajam1507 wrote:
             | You don't think leveraging your site to DDOS someone is a
             | problem?
             | 
             | Do people not also deserve to be protected from being
             | DDOSed? Do people also not deserve to not have their
             | internet traffic be used to DDOS someone?
        
               | choo-t wrote:
               | > You don't think leveraging your site to DDOS someone is
               | a problem?
               | 
               | It is, but it's one of the only tools they have to
               | prevent the doxxing site to being reachable.
               | 
               | > Do people not also deserve to be protected from being
               | DDOSed?
               | 
               | You mean the person doing the doing should be protected ?
               | 
               | >Do people also not deserve to not have their internet
               | traffic be used to DDOS someone?
               | 
               | Yes, it should have been opt-in. But unless you doesn't
               | run JS, you kinda give right to the website you visit to
               | run arbitrary code anyway.
        
               | kpcyrd wrote:
               | You don't think non-consensually revealing somebody's
               | identity is a problem?
               | 
               | Resorting to DDoS is not pretty, but "why is my violent
               | behavior met with violence" is a little oblivious and
               | reversal of victim and perpetrator roles.
        
               | ajam1507 wrote:
               | > You don't think non-consensually revealing somebody's
               | identity is a problem?
               | 
               | I do think it's a problem. You are the only one excusing
               | bad behavior here.
        
               | psychoslave wrote:
               | Not defending any party, it's basic ethological
               | expectation: a creature that try to beat an other should
               | expect aggressive response in return.
               | 
               | Of course, never aggressing anyone and transform any
               | aggression agaisnt self into an opportunity to
               | acculturate the aggressor into someone with the same
               | empathic behavior is a paragon of virtuous entity. But
               | paragons of virtue is not the median norm, by definition.
        
               | ajam1507 wrote:
               | > Not defending any party, it's basic ethological
               | expectation: a creature that try to beat an other should
               | expect aggressive response in return.
               | 
               | Another basic ethological expectation is that the strong
               | dominate the weak, but maybe we shouldn't base our moral
               | framework around how things are, and rather on how they
               | should be.
        
               | staticassertion wrote:
               | I think this is a weak framing. Lots of things are moral
               | or immoral under specific circumstances. We should
               | protect people from being murdered. I think murder is
               | usually wrong. But we also likely agree that there are
               | circumstances in which killing someone can be justified.
               | If we can find context for taking a life, I'm quite sure
               | we can find context for a DoS.
        
               | ajam1507 wrote:
               | And what's the context for using the internet traffic of
               | your unsuspecting users to accomplish this?
        
               | choo-t wrote:
               | Using the internet trafic of the persons using your
               | service to protect your anonymity and thus, protecting
               | the service itself.
        
               | ajam1507 wrote:
               | So you shouldn't have to inform your users that their
               | traffic will be used in a cyberattack?
        
               | RobotToaster wrote:
               | In most jurisdictions informing them would potentially
               | make them legally liable. The fact they had no knowledge
               | shields them from liability.
        
               | ajam1507 wrote:
               | So their desire to not be used to commit a cyberattack
               | doesn't factor in? As long as they aren't legally liable,
               | it doesn't matter?
               | 
               | Also a checkbox that says something like "I would like to
               | help commit a crime using my internet traffic" would keep
               | people from having their traffic used without consent.
        
               | ryandrake wrote:
               | Unfortunately "consent" is a difficult to understand
               | concept for a lot of the web and Silicon Valley.
        
               | staticassertion wrote:
               | I don't have strong feelings about that one way or the
               | other, honestly.
        
               | RobotToaster wrote:
               | There's an old legal maxim "in pari delicto potior est
               | conditio defendentis", that is "in a case of mutual fault
               | the position of the defending party is the better one."
        
               | ajam1507 wrote:
               | That works better when there is a defendant.
        
               | mikkupikku wrote:
               | People do not ever have any sort of moral or natural
               | right to not get hit after starting shit.
        
               | ajam1507 wrote:
               | Even if this were true, this does not justify any
               | particular type of action, except maybe an in kind
               | response.
               | 
               | For example, would they have been justified to murder the
               | blogger?
        
           | Obscurity4340 wrote:
           | I had no idea that was the actual situation (journalist
           | trying to hunt them down). Sorta changes the moral calculus,
           | I'll allow it
        
           | MSFT_Edging wrote:
           | If there's ever something a journalist would never ever do,
           | it's destroy someone's life for a headline. Never ever.
           | Totally impossible.
        
           | rdevilla wrote:
           | This is great. Journalists are impeding the preservation of
           | the historical record by blocking archivist traffic while
           | simultaneously manhunting those archivists who find ways
           | around their authwalls.
           | 
           | Soon the news and the historical facts will be unnecessary.
           | You can simply receive your wisdom from the AIs, which, as
           | nondeterministic systems, are free to change the facts at
           | will.
        
             | Permit wrote:
             | >This is great. Journalists are impeding the preservation
             | of the historical record by blocking archivist traffic
             | while simultaneously manhunting those archivists who find
             | ways around their authwalls.
             | 
             | You are deliberately misrepresenting the situation. The
             | journalists who block archivist traffic are not in any way
             | connected to the blogger who was attempting to investigate
             | the creator of archive.is. You have portrayed them as
             | related in an attempt to garner sympathy for the creator of
             | archive.is.
             | 
             | Here is an account of the facts:
             | https://gyrovague.com/2026/02/01/archive-today-is-
             | directing-...
        
               | ThoAppelsin wrote:
               | Thanks for this. I didn't know about the details, and
               | there are probably mor... but this gyrovague person is
               | clearly being a privileged trouble. Their "boringly
               | straightforward curiosity" is an admittance of their
               | shallow thinking. When you are pointed out that you're
               | hurting someone in some respect that you weren't
               | intentional about, you should stop, sit down, and
               | reconsider everything in that respect.
               | 
               | You may end up deciding to continue inflicting harm,
               | intentionally so this time---that is a perfectly valid
               | course to take. But you _cannot_ anymore remain
               | unintentional about it.
        
               | ImPostingOnHN wrote:
               | _> When you are pointed out that you're hurting someone
               | in some respect that you weren't intentional about, you
               | should stop, sit down, and reconsider everything in that
               | respect._
               | 
               |  _> You may end up deciding to continue inflicting harm,
               | intentionally so this time---that is a perfectly valid
               | course to take. But you cannot anymore remain
               | unintentional about it._
               | 
               | To be clear, are you talking about the harm of commanding
               | a botnet (which includes you and me) to attack an
               | investigative journalist for investigatively journaling?
        
               | freedomben wrote:
               | Indeed. I am highly supportive of archive.is, but let's
               | remember that _he hijacked his own users to become a bot
               | net_. That should make all us hackers furious. Is a
               | complete violation of trust. Our residential IPs were
               | used to attack someone, meaning he put us all at risk for
               | his own personal goals. It 's disgusting behavior and he
               | should be called out for it. But we should also realize
               | he's offering an important and free service to us all. I
               | support him, but this is not something we should just
               | ignore. Trust is very important.
        
               | charcircuit wrote:
               | Review the definition of botnet. That is not what was
               | done.
        
               | heavyset_go wrote:
               | I didn't think I was going to side with the DDoS-er, but
               | considering what happened with Aaron Schwartz, that
               | blogger was trying to get them killed or put in a box
               | forever.
        
           | staticassertion wrote:
           | They're terrible for not wanting to be dox'd?
        
             | philistine wrote:
             | They're terrible for turning all of us into parts of a
             | botnet DDOS someone doing their job. I don't understand how
             | DDOS is the correct tool for anyone to protect their
             | anonymity.
        
         | 8cvor6j844qw_d6 wrote:
         | Agreed, and if archive.is goes down, archive.org becomes the de
         | facto monopoly in web archival.
         | 
         | That's a problem because archive.org honors removal requests
         | from site owners. Buy an old domain and you can theoretically
         | wipe its archived history clean.
        
           | charcircuit wrote:
           | Alternatively, it leaves a vacuum for an archive site that
           | doesn't take things down like archive.org to exist and a new
           | one takes its place as the defacto one.
        
       | tossandthrow wrote:
       | I think media outlets think way too highly of their contribution
       | to AI.
       | 
       | Had they never existed, it had likely not made a dent to the AI
       | development - completely like believing that had they been twice
       | as productive, it had likely neither made a dent to the quality
       | of LLMs.
        
         | Freak_NL wrote:
         | How do you think those models get trained? You can only get so
         | far with Wikipedia, Reddit, and non-fiction works like books
         | and academic papers.
        
           | RugnirViking wrote:
           | How does the entire textual corpus of say, new York times
           | compare to all novels? Each article is a page of text, maybe
           | two at most? There certainly are an awful lot of articles.
           | But it's hard to imagine it is much more than a couple
           | hundred novels. There must be thousands of novels released
           | each year
        
             | Freak_NL wrote:
             | Like apples to oranges.
             | 
             | LLMs are (apparently) massively used to get information
             | about topics in the real world. Novels aren't going to be
             | much help there. Journalism, particularly in written form,
             | provides a fount of facts presented from different angles,
             | as well as opinions, and it was all there free for the
             | taking...
             | 
             | Wikipedia provides the scantest summary of that, fora and
             | social media give you banter, fake news, summaries of news,
             | and a whole lot of shaky opinions, at best. Novels give you
             | the foundations of language, but in terms of knowledge
             | nothing much beyond what the novel is about.
        
               | olalonde wrote:
               | LLMs can get up to date information from primary sources
               | - no journalists required.
        
               | freedomben wrote:
               | Primary sources can and often are, very biased.
               | Journalists are (supposed to be) doing fact checks and
               | gathering multiple sources from all sides. Modern
               | journalism is in a terrible state, but still important.
               | 
               | Imagine if all info about Facebook came from Facebook...
        
               | PopAlongKid wrote:
               | I don't understand how LLMs can ask questions at a press
               | conference.
        
               | olalonde wrote:
               | Startup idea right there.
        
               | AnthonyMouse wrote:
               | To begin with, your premise is that the only primary
               | sources are press conferences and that press conferences
               | only provide information in response to questions.
               | 
               | But even taking it literally, isn't that one of the
               | things LLMs could actually do? You're essentially asking
               | how a text generator could generate text. The real
               | question is whether the questions would be any good, but
               | the answer isn't necessarily no.
        
               | none2585 wrote:
               | I don't think an LLM can have secret human sources that
               | provide them with confidential information anonymously.
               | Not all news shows up on Twitter.
        
               | miki123211 wrote:
               | You don't need the secret human sources any more.
               | 
               | You used to need them, because journalists had the
               | distribution and the sources didn't. In a word of printed
               | newspapers, you couldn't get your story distributed
               | nationally (much less worldwide) without the help of a
               | journalist, doubly so if you wanted to stay anonymous.
               | 
               | Nowadays, you just make a Substack and there's that.
               | 
               | See that recent expose on the Delve fraud as just one
               | example. No journalists were harmed in the making of that
               | article.
        
               | ajam1507 wrote:
               | The primary source for most news is journalism.
        
               | NiloCK wrote:
               | In context, _primary source_ means the subject of the
               | article (the thing the journalist is writing about).
               | 
               | Journalism is by definition a secondary source.
               | (Notwithstanding edge cases like articles reporting
               | directly on the news industry itself.)
        
               | ajam1507 wrote:
               | Journalism is absolutely not by definiton a secondary
               | source.
               | 
               | If a journalist is on location covering a flood, for
               | example, they are the primary source.
               | 
               | A journalist conducting an interview would also be a
               | primary source.
        
           | tossandthrow wrote:
           | Have a look at this article: https://www.washingtonpost.com/t
           | echnology/interactive/2023/a...
           | 
           | NY Times is 0.06% of common crawl.
           | 
           | These news media outlets provide a drop in the ocean worth of
           | information. Both qualitatively and quantitatively.
           | 
           | The news / media industry is really just trying to hold on to
           | their lifeboat before inevitably becoming entirely
           | irrelevant.
           | 
           | (I do find this sad, but it is like the reality - I can
           | already now get considerably better journalism using LLMs
           | than actual journalists - both click bait stuff and high
           | quality stuff)
        
             | pimlottc wrote:
             | That seems like a reductive way to consider it. What
             | percent of music was created by Led Zeppelin? What percent
             | of art was painted by Monet? What percent of films by
             | Alfred Hitchcock? It may be a small percentage objectively
             | but they are hugely influential.
        
               | tossandthrow wrote:
               | I don't think back propagation care whose text it is back
               | propagating.
        
               | NiloCK wrote:
               | The data sets aren't naively fed into the training runs.
               | 
               | Instead, training attempts to sample more heavily from
               | higher quality sources, with, I'm sure, a mix of manual
               | and heuristic labeling.
        
               | ffsm8 wrote:
               | fwiw, no llm ive ever used generated in the writing style
               | newspapers and -sites use - hence i honestly doubt
               | they've been given a meaningful boost in relevancy.
               | 
               | their idioms would leak occasionally otherwise
        
             | Gigachad wrote:
             | 90% of common crawl is complete junk. While the tiny bit of
             | news articles powers almost all the ai answers in Google
             | search.
        
             | datsci_est_2015 wrote:
             | How many Reddit, HN, etc. posts are based on NYT articles?
             | How many derivative news articles, blog posts, YouTube
             | videos, TikToks, etc. are responses to those articles?
             | 
             | At least NYT is probably on the correct side of Sturgeon's
             | Law: https://en.wikipedia.org/wiki/Sturgeon%27s_law
        
               | AnthonyMouse wrote:
               | > How many Reddit, HN, etc. posts are based on NYT
               | articles? How many derivative news articles, blog posts,
               | YouTube videos, TikToks, etc. are responses to those
               | articles?
               | 
               | You may get an inconvenient answer when you ask the
               | question the other way around.
        
             | Melatonic wrote:
             | 0.06% is way higher than I would expect
        
         | phatfish wrote:
         | Isn't the non-LLM generated text becoming more valuable for
         | training as the web at large is flooded with slop?
         | 
         | Preventing new human generated text from being used by AI firms
         | (without consent) seems like a valid strategy.
        
           | tossandthrow wrote:
           | No.
           | 
           | Modern LLMs are trained on a large percentage of synthetic
           | data.
           | 
           | This sentiment is largely legacy (even though just a couple
           | of years old).
        
       | Havoc wrote:
       | As someone perpetually online it's also making me rethink that a
       | bit
       | 
       | Unless you love walled gardens, doomscrolling and endless AI slop
       | that seems like the fun is over
        
       | stuaxo wrote:
       | The New York Times is awful I want it to be archived so people
       | can see that in the future.
        
         | Archonical wrote:
         | I don't read it. Why is it awful?
        
           | lyu07282 wrote:
           | From Manufacturing Consent:
           | 
           | > by selection of topics, by distribution of concerns, by
           | emphasis and framing of issues, by filtering of information,
           | by bounding of debate within certain limits. They determine,
           | they select, they shape, they control, they restrict -- in
           | order to serve the interests of dominant, elite groups in the
           | society."
           | 
           | > "history is what appears in The New York Times archives;
           | the place where people will go to find out what happened is
           | The New York Times. Therefore it's extremely important if
           | history is going to be shaped in an appropriate way, that
           | certain things appear, certain things not appear, certain
           | questions be asked, other questions be ignored, and that
           | issues be framed in a particular fashion."
           | 
           | The propaganda in the New York times is especially precious
           | because of how highly respected it is, there never was a war
           | or other elite interest they didn't push along.
        
           | mikkupikku wrote:
           | They have a very long track record of pretending to be
           | independent but actually toeing the government's line at key
           | pivotal moments in history when an independent newspaper is
           | needed the most. Everybody here knows how they helped start
           | the second Iraq war I hope, but that wasn't a one-off fluke.
           | Go back through the major wars in American history and you
           | can find the New York Times championing the cause of war
           | before each of these. World Was 2, they uncritically accepted
           | Walter Durranty letting Stalin ghostwrite for him,
           | specifically w.r.t. Stalin's man-made famine in Ukraine,
           | because America was allied with Stalin. WWI, frequent
           | editorializing of Germans being wild Asiatic savages while
           | the Anglos were good and noble people that Americans owed
           | something to for some reason nobody could explain. Vietnam,
           | they uncritically accepted government reports on the second
           | Gulf of Tonkin incident _which never happened_ and broadly
           | accepted the governments own reports about how the war was
           | going, at least in the early years when it still might have
           | been possible to avoid further engagement. Korean war, they
           | supported the government narrative of communist containment.
           | First Iraq War, they uncritically reported very dubious
           | atrocity propaganda, like the fraudulent  "Nayirah testimony"
           | given by the teenage daughter of a diplomat pretending to be
           | a politically uninvolved hospital worker.
           | 
           | The pattern here is deference to official narratives at
           | precisely the times when criticism is needed the most.
        
             | martey wrote:
             | > _World Was 2, they uncritically accepted Walter Durranty
             | letting Stalin ghostwrite for him, specifically w.r.t.
             | Stalin 's man-made famine in Ukraine, because America was
             | allied with Stalin._
             | 
             | Duranty's New York Times articles were written in 1931, a
             | decade before America entered World War II. They not only
             | predate an American alliance with the Soviet Union, but
             | they also predate the United States having any diplomatic
             | relations with the Soviet Union whatsoever.
             | 
             | > _Go back through the major wars in American history and
             | you can find the New York Times championing the cause of
             | war before each of these._
             | 
             | Are there other major American newspapers who have a
             | history of dissenting against war? Wasn't the New York
             | Times' behavior in most of the conflicts you mention in
             | line with American popular opinion?
        
               | mikkupikku wrote:
               | The American political apparatus was already normalizing
               | relations with the Soviet Union due to the Japanese
               | invasion of Manchuria (1931, which is when WW2 truly
               | started), due to the great depression in America making
               | alliance with the Soviets look economically advantageous
               | for America, and due to political instability in Germany
               | and Italy. There was a strong sense of shit hitting the
               | fan soon and that America would be with the Soviet Union
               | through it. FDR officially recognized the Soviet Union in
               | 1933, during the peak of Stalin's famine in Ukraine,
               | which the New York Times was actively denying.
               | 
               | As for other newspapers, the Times isn't worse but bears
               | the brunt of the criticism because they are after all
               | America's foremost, most influential newspaper.
        
               | martey wrote:
               | Your comment is full of historical revisionism. The
               | Second World War has little or nothing to do with the
               | Holodomor. The Times' lack of reporting on it has nothing
               | to do with American foreign policy (both Duranty and
               | Gareth Jones were British) and everything to do with
               | credulous reporters. The idea that America and the Soviet
               | Union would be natural allies was not the majority
               | viewpoint in the 1930s (outside of American communist
               | propaganda) and is clearly disproved by the Molotov-
               | Ribbentrop Pact.
        
               | lyu07282 wrote:
               | > Wasn't the New York Times' behavior in most of the
               | conflicts you mention in line with American popular
               | opinion?
               | 
               | Dear god, what? I love the unintentional satire its so
               | funny. "Its fine if the media lies to the people if the
               | people believe the lies." That's low even for this
               | stemlord dumpsterfire of a platform
        
               | martey wrote:
               | > _" Its fine if the media lies to the people if the
               | people believe the lies."_
               | 
               | That is low, but that's neither a direct quote or not an
               | accurate paraphrase of my comment. While I realize that
               | the comment I replied was edited after my response to
               | talk about lying in more recent conflicts (which might be
               | causing your confusion), I don't think you (like OP) are
               | trying to make the argument that the New York Times is
               | bad because of their reporting in the 1930s.
        
             | martey wrote:
             | It's bad etiquette to edit your comment after people have
             | replied to it without showing what your edits were. Please
             | do not do this.
        
         | gsky wrote:
         | All media opinion articles are nothing but propaganda pieces.
         | Every media out only allows those aligned with their ideology
         | to write those pieces
        
       | VladVladikoff wrote:
       | As a site operator who has been battling with the influx of
       | extremely aggressive AI crawlers, I'm now wondering if my tactics
       | have accidentally blocked internet archive. I am totally ok with
       | them scraping my site, they would likely obey robots.txt, but
       | these days even Facebook ignores it, and exceeds my stipulated
       | crawl delay by distributing their traffic across many IPs. (I
       | even have a special nginx rule just for Facebook.)
       | 
       | Blocking certain JA3 hashes has so far been the most effective
       | counter measures. However I wish there was an nginx wrapper
       | around hugin-net that could help me do TCP fingerprinting as
       | well. As I do not know rust and feel terrified of asking an LLM
       | to make it. There is also a race condition issue with that
       | approach, as it is passive fingerprinting even the JA4 hashes
       | won't be available for the first connection, and the AI crawlers
       | I've seen do one request per IP so you don't get a chance to
       | block the second request (never happens).
        
         | mycall wrote:
         | Evasion techniques like JA3 randomization or impersonation can
         | bypass detection.
        
           | VladVladikoff wrote:
           | I am aware, fortunately I haven't seen much of this... yet.
           | Also JA4 is supposed to be a bit less vulnerable to this.
           | Also this is why I really want TCP and HTTP fingerprinting.
           | But the best i've found so far is
           | https://github.com/biandratti/huginn-net and is only
           | available as rust library, I really need it as an nginx
           | module. I've been tempted to try to vibe code an nginx module
           | that wraps this library.
        
         | andrepd wrote:
         | I wonder if it would be practical to have bot-blocking measures
         | that can be bypassed with a signature from a set of whitelisted
         | keys... In this case the server would be happy to allow
         | Internet Archive crawlers.
        
           | freedomben wrote:
           | That's an interesting idea. Mtls could probably be used for
           | this pretty easily. It would require IA to support it if
           | course, but could be a nice solution. I wonder, do they
           | already support it? I might throw up a test...
        
         | danrl wrote:
         | > they would likely obey robots.txt
         | 
         | If only... Despite providing a useful service, they are not as
         | nice towards site owners as one would hope.
         | 
         | Internet Archive says:
         | 
         | > We see the future of web archiving relying less on robots.txt
         | file declarations geared toward search engines
         | 
         | https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
         | 
         | They are not alone in that. The "Archiveteam", a different
         | organization, not to be confused with archive.org, also doesn't
         | respect robots.txt according to their wiki:
         | https://wiki.archiveteam.org/index.php?title=Robots.txt
         | 
         | I think it is safe to say that there is little consideration
         | for site owners from the largest archiving organizations today.
         | Whether there should be is a different debate.
        
           | sunaookami wrote:
           | What an absolutely insufferable explanation from ArchiveTeam.
           | What else do you expect from an organization aggressively
           | crawling websites and bringing them down to their knees
           | because they couldn't care less?
        
             | rossng wrote:
             | I'm curious to hear about examples of where this has
             | happened. Because ArchiveTeam also has an important role in
             | rescuing cultural artefacts that have been taken into
             | private hands and then negligently destroyed.
        
               | tredre3 wrote:
               | Having a laudable goal doesn't absolve them from bad
               | behavior.
        
             | tech234a wrote:
             | That page was written by Jason Scott in 2011 and has barely
             | been changed since then.
        
             | wlonkly wrote:
             | ArchiveTeam (which is not the Internet Archive)
             | aggressively crawls websites because they care _a lot_ ,
             | because the website in question is about to go away.
             | 
             | Heck, I'd say as caring goes, ArchiveTeam cares more than
             | the owners of the website, because in the ideal shutdown,
             | the owners provide the data instead of forcing people to
             | scrape it if they want to retain it after the site shuts
             | down.
        
           | AnthonyMouse wrote:
           | It seems like the general problem is that the original common
           | usage of robots.txt was to identify the parts of a site that
           | would lead a recursive crawler into an infinite forest of
           | dynamically generated links, which nobody wants, but it's
           | increasingly being used to disallow the fixed content of the
           | site which is the thing they're trying to archive and which
           | shouldn't be a problem for the site when the bot is caching
           | the result so it only ever downloads it once. And more sites
           | doing the latter makes it hard for anyone to distinguish it
           | from the former, which is bad for everyone.
           | 
           | > The "Archiveteam", a different organization, not to be
           | confused with archive.org, also doesn't respect robots.txt
           | according to their wiki
           | 
           | "Archiveteam" exists in a different context. Their usual
           | purpose is to get a copy of something _quickly_ because it 's
           | expected to go offline soon. This both a) makes it irrelevant
           | for ordinary sites in ordinary times and b) gives the ones
           | about to shut down an obvious thing to do, i.e. just give
           | them a better/more efficient way to make a full archive of
           | the site you're about to shut down.
        
       | b1n wrote:
       | Archive now, make public after X amount of time. So, maybe both
       | publisher and archiver are happy (or less sad).
        
       | catapart wrote:
       | I'm seeing a lot of comments about how we maintain the status
       | quo, but I'm very interested in hearing from anyone who has
       | conceded that there is no way to stop AI scrapers at this point
       | and what that means for how we maintain public information on the
       | internet in the future.
       | 
       | I don't necessarily believe that we won't find some half-
       | successful solution that will allow server hosting to be done as
       | it currently is, but I'm not very sure that I'll want to
       | participate in whatever schemes come about from it, so I'm
       | thinking more about how I can avoid those schemes rather than
       | insisting that they won't exist/work.
       | 
       | The prevailing thought is that if it's not possible now, it won't
       | be long before a human browser will be indistinguishable from an
       | LLM agent. They can start a GUI session, open a browser, navigate
       | to your page, snapshot from the OS level and backwork your
       | content from the snapshot, or use the browser dev tools or
       | whatever to scrape your page that way. And yes, that would be
       | much slower and more inefficient than what they currently do, but
       | they would only need to do that for those that keep on the
       | bleeding edge of security from AI. For everyone else, you're in a
       | security race against highly-paid interests. So the idea of
       | having something on the public internet that you can stop people
       | from archiving (for whatever purpose they want) seems like it's
       | soon to be an old-fashioned one.
       | 
       | So, taking it as a given that you can't stop what these people
       | are currently trying to stop (without a legislative solution and
       | an enforcement mechanism): how can we make scraping less of a
       | burden on individual hosts? Is this thing going to coalesce into
       | centralizing "archiving" authorities that people trust to archive
       | things, and serve as a much more structured and friendly way for
       | LLMs to scrape? Or is it more likely someone will come up with a
       | way to punish LLMs or their hosts for "bad" behavior? Or am I
       | completely off base? Is anyone actually discussing this? And, if
       | so, what's on the table?
        
         | titzer wrote:
         | You're going to hate this, but one answer might be blockchain.
         | A crytographically strong, attestable public record of
         | appending information to a shared repository. Combined with
         | cryptographic signatures for humans, it's basically a secure,
         | open git repository for human knowledge.
        
           | catapart wrote:
           | Sounds interesting, but I guess I'm a little unsure of how to
           | connect the dots? Are you suggesting that websites would be
           | hosted on a blockchain and browsed by human-signed browsers?
           | Or more like there would be a blockchain authority, which
           | server hosts could query to determine if a signature,
           | provided by their browser, is human? Would you mind painting
           | the picture in a little more detail?
        
           | echelon wrote:
           | We're rarely going to need to attest anything is "real" or
           | "human". It's basically only going to matter in civil and
           | criminal court, and IDV.
           | 
           | We don't need to attest signals are analogue vs. digital. The
           | world is going to adapt to the use of Gen AI in everything.
           | The future of art, communications, and productivity will all
           | be rooted in these tools.
        
           | sharperguy wrote:
           | You can have cryptographically signed data caches without the
           | need for a blockchain. What a blockchain can add is the
           | ability to say that a particular piece of data must have
           | existed before a given date, by including the hash of that
           | data somewhere in the chain.
        
           | techjamie wrote:
           | > Combined with cryptographic signatures for humans
           | 
           | What happens when the human gives an agent access to said
           | signature? Then you fall back on traditional anti-bot
           | techniques and you're right back where you started.
        
             | jakeydus wrote:
             | DNA/biometrics are the only secure future!
             | 
             | I joke, but there are those out there who don't.
        
           | amarant wrote:
           | You'd spend less compute just serving the crawlers than
           | maintaining the Blockchain.
           | 
           | Like, 3 orders of magnitude less compute, conservatively
           | counting.
        
         | heavyset_go wrote:
         | If you don't publish content to the public web anymore, you
         | don't have to worry traffic or scraping or bots
         | 
         | Maybe it'll just be cheaper for CDNs or whatever to sell the
         | data they serve directly instead of doing extra steps with
         | scraping
        
           | miki123211 wrote:
           | The only answer is WebDRM.
           | 
           | It's easy to pretend you're human, it's hard to pretend that
           | you have a valid cryptographic signature for Google which
           | attests that your hardware is Google-approved.
           | 
           | Crawling is the price we pay for the web's openness.
        
             | realusername wrote:
             | It's not hard to bypass attestation, it's actually very
             | easy and done right now at scale, there's giant click farms
             | with phones on racks.
             | 
             | They don't modify any device and will pass whatever
             | attestation you try to make.
        
           | eikenberry wrote:
           | I think this is what will happen. That the public internet
           | will become the place you go to seed the data you want to the
           | scrapers and you will use a private internet for everything
           | else. Private sites, private feeds, mesh networks, etc. We're
           | basically going back in time similar to when AOL and friends
           | had their own private networks for their members.
        
         | suzzer99 wrote:
         | I don't see this is a permanent problem. Right now there must
         | be 1000s of well-funded AI companies trying to scrape the
         | entire internet. Eventually the AI equity bubble will pop and
         | there will be consolidation. If every player left has already
         | scanned the web, will they need to keep constantly scanning it?
         | Seems like no. Even if they do, there will be a lot less of
         | them.
        
           | kdheiwns wrote:
           | The current trend is that it's getting cheaper and easier to
           | roll out your own AI on your own computer, so more and more
           | people will do it as a hobby. Even if the big players die
           | out, some dude with a decent gaming PC could decide to start
           | scraping everything pertaining to their interests just for
           | the hell of it. Every government with a budget and someone
           | capable of doing the job will surely get in on it as well.
        
             | overfeed wrote:
             | > some dude with a decent gaming PC could decide to start
             | scraping everything pertaining to their interests just for
             | the hell of it.
             | 
             | Not from their single residential IP, they are not.
             | 
             | If they do succeed[1] - it is not going to be at hundreds
             | or thousands of requests per second that the current AI
             | scrapers bombard servers with. Some dude at home will, at
             | best, be putting 4-6 orders of magnitude less strain on a
             | limited set of servers.
             | 
             | 1. Scraping is an arms race: if you're just "some dude" at
             | the skill floor - you're going to have a bad time whether
             | you're scraping, or defending against scrapers.
        
         | ronsor wrote:
         | > without a legislative solution and an enforcement mechanism
         | 
         | If there's one thing people, especially HN users, should've
         | learned by now, it's that there's no enforcement mechanism
         | worth a damn for Internet legislation when incentives don't
         | align.
        
         | AnthonyMouse wrote:
         | > how can we make scraping less of a burden on individual
         | hosts?
         | 
         | Isn't this basically what content-addressable storage is for?
         | Have the site provide the content hashes rather than the
         | content and then put the content on IPFS/BitTorrent/whatever
         | where the bots can get it from each other instead of bothering
         | the site.
         | 
         | Extra points if you can get popular browsers to implement
         | support for this, since it also makes it a lot harder to censor
         | things and a decent implementation (i.e. one that prefers
         | closer sources/caches) would give most of the internet the
         | efficiency benefits of a CDN without the centralization.
        
         | zer00eyz wrote:
         | > anyone who has conceded that there is no way to stop AI
         | scrapers at this point and what that means for how we maintain
         | public information on the internet in the future.
         | 
         | Bloat, and bandwidth costs are the real problems here. Every
         | one seems to have forgotten basics of engineering and
         | accounting.
        
       | rkwtr1299 wrote:
       | The EFF has a lukewarm stance on AI, but criticizes everyone
       | else. AI is clearly ruining the Internet and the job market.
       | 
       | How about thinking about your mission and take an anti-AI
       | hardliner stance? But I see multiple corporate sponsors that
       | would not be pleased:
       | 
       | https://www.eff.org/thanks
       | 
       | All these so called freedom organizations like the OSI and the
       | EFF have been bought and are entirely irrelevant if not harmful.
        
       | rdiddly wrote:
       | When you disappear from the historical record, that's called you
       | becoming irrelevant. The world moves on, and pays attention to
       | someone else. Not sure why the Times doesn't seem to see this
       | angle.
        
       | lich_king wrote:
       | I am really tired of this kind of moralizing. The reality is that
       | every time geeks come up with some utopian ideal, such as that we
       | should publish all our software under free licenses or make all
       | human knowledge freely accessible to anyone, _the same geeks
       | later show up and build extractive industries on top of this_. Be
       | a part of the open source revolution... so that you do unpaid
       | labor for Facebook. Make a quirky homepage... so that we can
       | bootstrap global-scale face recognition tech. Help us build the
       | modern-day library of Alexandria... so that OpenAI and Anthropic
       | can sell it back to you in a convenient squeezable tube.
       | 
       | Maybe it's time to admit that the techie community has a pretty
       | bad moral compass and that we're not good stewards of the world's
       | knowledge. We turn lofty ideals into amoral money-making schemes
       | whenever we can. I'm not sure that the EFF's role in this is all
       | that positive. They come from a good place, but they ultimately
       | aid a morally bankrupt industry. I don't want archive.org to
       | retain a copy of everyone's online footprint because I know it be
       | used the same way it always is: to make money off other people's
       | labor and to and erode privacy.
        
         | Peritract wrote:
         | Agreed; again and again, we see that the utopian ideals of the
         | tech world are only the ones that let them extract value
         | without consideration.
        
       | alexpotato wrote:
       | As someone who did a lot of work on early spam fighting only to
       | see it replaced by things like DKIM, I wonder if we are going to
       | start having the "taxi medallion" style approach but for people
       | connecting to your site.
       | 
       | e.g. IA will publish out signed https requests with their key so
       | you, as the site owner, can confirm that it is indeed from them
       | and not from AI.
       | 
       | Feels like that would be very anti open internet but not sure how
       | else you would prove who is a good actor vs not (from your
       | perspective that is).
        
         | m3047 wrote:
         | I'll tell you what I expect to see from crawlers, agents and
         | which I'm enforcing on everybody who doesn't look distinctly
         | human:
         | 
         | * Reverse DNS which points to a web site which has a
         | discoverable / well-known page which clearly describes their
         | behavior.
         | 
         | * Some sort of reverse IP based, RBL and SPF -inspired TXT
         | records which describe who, what, when, why, how, how often
         | 
         | so that I can make automated decisions based on it.
         | 
         | Yah, I don't have a lot of crawlers that I welcome... but I'm
         | building a pretty good database of the worst offenders. At
         | scale... there are advantages to scale which work in my favor,
         | actually.
         | 
         | I documented this at the end of a blog post when I made
         | blocking Amazon incoming requests a default policy several
         | years ago.
        
       | ashwinnair99 wrote:
       | We're essentially burning the library to punish the arsonist. The
       | arsonist already left.
        
         | tremon wrote:
         | What do you mean, "the arsonist already left"? Isn't it more
         | accurate to say that 90% of the library's visitors are
         | arsonists?
        
           | pamcake wrote:
           | It is not accurate. A very small number of actors pose as
           | many and make up the majority of traffic. For example, your
           | User-Agent block may cut traffic by 10%, 99% of which
           | malicious - but you blocked 1000 individuals, only 1 of which
           | malicious.
        
       | phendrenad2 wrote:
       | Does IA use a known set of IPs? Should be trivial to let them
       | through. But yeah, news companies aren't technically capable of
       | this kind of finesse, they probably have by-the-hour contractors
       | doing any coding/config changes, and closing the ticket is the
       | goal there.
        
       | neilv wrote:
       | I'm now an AI bro, and a long-time fan of the EFF (though they
       | occasionally make a mistake).
       | 
       | I think this EFF piece could be more forthright (rather than
       | political persuasion), since the matter involves balancing
       | multiple public interest goals that are currently in opposition.
       | 
       | > _Organizations like the Internet Archive are not building
       | commercial AI systems._
       | 
       | This NiemanLab article lists evidence that Internet Archive
       | explicitly encouraged crawling of their data, which was used for
       | training major commercial AI models:
       | 
       | | News publishers limit Internet Archive access due to AI
       | scraping concerns (niemanlab.org) | 569 points by ninjagoo 34
       | days ago | 366 comments |
       | https://news.ycombinator.com/item?id=47017138
       | 
       | > _[...] over a fight that libraries like the Archive didn 't
       | start, and didn't ask for._
       | 
       | They started or stumbled into this fight through their actions.
       | And (ideology?) they also started and asked for a related fight,
       | about disregard of copyright and exploitation of creators:
       | 
       | | Internet Archive forced to remove 500k books after publishers'
       | court win (arstechnica.com) | 530 points by cratermoon on June
       | 21, 2024 | 564 comments |
       | https://news.ycombinator.com/item?id=40754229
        
       | charcircuit wrote:
       | The EFF is being obtuse. Using archives sites is a known bypass
       | for reading news articles for free. Every time a paywalled site
       | someone posts an archive link so others can read for free.
       | 
       | >Archiving and Search Are Legal
       | 
       | But giving full articles away for free to everyone is not.
       | Archive.org has the power to make archives private.
        
       | m3047 wrote:
       | If you're selling ammonium nitrate and diesel, it's a reasonable
       | presumption that you're in the agricultural supply business. It's
       | also reasonable to expect you not to sell a truckload of both to
       | someone who you don't know to be a farmer.
        
       ___________________________________________________________________
       (page generated 2026-03-21 23:00 UTC)