[HN Gopher] Blocking Internet Archive Won't Stop AI, but Will Er...
___________________________________________________________________
Blocking Internet Archive Won't Stop AI, but Will Erase Web's
Historical Record
Author : pabs3
Score : 471 points
Date : 2026-03-21 07:30 UTC (15 hours ago)
(HTM) web link (www.eff.org)
(TXT) w3m dump (www.eff.org)
| xnx wrote:
| Does Internet Archive have a distributed residential IP crawler
| program? I would enthusiastically contribute to that.
|
| There must be some mechanism to prevent tampering in such a
| setup.
| progval wrote:
| The Internet Archive does not, but Archive Team does:
| https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior
| xnx wrote:
| Yes! I'm running an instance right now.
| Retr0id wrote:
| > There must be some mechanism to prevent tampering in such a
| setup.
|
| Trivial as long as they terminate the TLS on their end, not
| yours. So you'd just be a residential proxy.
| gzread wrote:
| No, IA does everything above board and even honors invalid DMCA
| takedowns.
| SlinkyOnStairs wrote:
| Devil's advocate: Anyone seeking to limit AI scraping doesn't
| have much of a choice in also blocking archivists.
|
| And it's genuinely not that weird for news organisations to want
| to stop AI scraping. This is just a repeat of their fight with
| social media embedding.
|
| Sure. The _back catalogue_ should be as close to public domain as
| possible, libraries keeping those records is incredibly important
| for research.
|
| But with current news, that becomes complicated as taking the
| articles and not paying the subscription (or viewing their ads)
| directly takes away the revenue streams that newsrooms rely on to
| produce the news. Hence the "Newspaper trying to ban linking"
| mess, which was never about the links themselves but about social
| media sites embedding the headline and a snippet, which in turn
| made all the users stop clicking through and "paying" for the
| article.
|
| Social media relies on those newsrooms (same with really, most
| other kinds of websites) to provide a lot of their content. And
| AI relies on them for all of the training data (remember:
| "Synthetic data" does not appear ex nihilo) & to provide the news
| that the AI users request. We can't just let the newsrooms die.
| The newsroom hasn't been replaced itself, it's revenue has been
| destroyed.
|
| ---
|
| And so, the question of archives pops up. Because yes, you can
| with some difficulty block out the AI bots, even the social media
| bots. A paywall suffices.
|
| But this kills archiving. Yet if you whitelist the archives in
| some way, the AI scrapers will just pull their data out of the
| archive instead and the newsrooms still die. (Which also makes
| the archiving moot)
|
| A compromise solution might be for archives to accept/publish
| things on a delay, keep the AI companies from taking the current
| news without paying up, but still granting everyone access to
| stuff from decades ago.
|
| There's just major disagreement about what a reasonable delay is.
| Most major news orgs and other such IP-holders are pretty upset
| about AI firm's "steal first, ask permission later" approach.
| Several AI firms setting the standard that training data is to be
| paid for doesn't help here either. In paying for training data
| they've created a significant market for archives, and
| significant incentive to not make them publicly freely
| accessible.
|
| Why would The Times _ever_ hand over their catalogue to the
| Internet Archive if Amazon will pay them a significant sum of
| money for it? The greater good of all humanity? Good luck getting
| that from a dying industry.
|
| ---
|
| Tangent: Another annoying wrinkle in the financial incentives
| here is that not all archiving organisations are engaging in fair
| play, which yet further pushes people to obstruct their work.
|
| To cite a HN-relevant example: Source code archivist "Software
| Heritage" has long engaged in holding a copy of all the
| sourcecode they can get their hands on, regardless of it's
| license. If it's ever been on github, odds are they're
| distributing it. Even when licenses explicitly forbid that. (This
| is, of course, perfectly legal in the case of actual research and
| other fair use. But:)
|
| They were notable involved in HuggingFace's "The Stack" project
| by sharing a their archives ... and received money from
| HuggingFace. While the latter is nominally a donation, this is in
| effect a sale.
|
| ---
|
| I find it quite displeasing that the EFF fails to identify the
| incentives at play here. Simply trying to nag everyone into
| "doing the thing for the greater good!" is loathsome and doesn't
| work. Unless we change this incentive structure, the outcome
| won't change.
| Obscurity4340 wrote:
| It would be better if there was some arrangement the papers
| could reach with Archive where they just delay the release or
| wait a week then its part of the archive. That way, news stuff
| gets paid for when its hot and fresh but then it gets archived
| and the record is preserved
| user_7832 wrote:
| > But in recent months The New York Times began blocking the
| Archive from crawling its website, using technical measures that
| go beyond the web's traditional robots.txt rules. That risks
| cutting off a record that historians and journalists have relied
| on for decades. Other newspapers, including The Guardian, seem to
| be following suit.
|
| I'm a bit surprised I never read about this till now, though
| while disappointing it is unfortunately not surprising.
|
| > The Times says the move is driven by concerns about AI
| companies scraping news content. Publishers seek control over how
| their work is used, and several--including the Times--are now
| suing AI companies over whether training models on copyrighted
| material violates the law. There's a strong case that such
| training is fair use.
|
| I suspect part of it might be these corps not wanting people to
| skip a paywall (whether or not someone would pay even if they had
| no access is a different story). But this argument makes no sense
| for the Guardian.
| user_7832 wrote:
| I went to Guardian's website to cross check their motto
| (getting confused with WaPo's motto) and got served this
| (hilarious? sad?) banner. As if blocking cross website tracking
| is somehow bad.
|
| > Rejection hurts ... You've chosen to reject third-party
| cookies while browsing our site. Not being able to use third
| party cookies means we make less from selling adverts to fund
| our journalism.
|
| We believe that access to trustworthy, factual information is
| in the public good, which is why we keep our website open to
| all, without a paywall.
|
| If you don't want to receive personalised ads but would still
| like to help the Guardian produce great journalism 24/7, please
| support us today. It only takes a minute. Thank you.
| mocd wrote:
| The Guardian's ads asking for contributions have got
| progressively more desperate. I find their commitment to
| keeping their site paywall free admirable, but the current
| almost-begging (and selling off their Sunday paper) has got
| so intense that it feels like it's only a matter of time
| until they introduce some kind of paid content.
| ryandrake wrote:
| Begging users to turn the tracking gun on themselves so
| they can be bombarded with ads is totally pathetic, and
| I've seen this on multiple news sites. These guys can't go
| out of business fast enough.
| duskdozer wrote:
| >If you don't want to receive *personalised ads*
|
| So ads, just not personalized. Remind me again why
| personalized ads are good for me if I have to pay to have
| non-personalized ads?
| none2585 wrote:
| I think their plea is: 'we make more money from
| personalized ads so help us make up the difference through
| donation (or whatever they're selling).'
| gzread wrote:
| This is why archive.is was created. Should we stop trying to hunt
| down and punish its creator and support it as the extremely
| useful project that it is?
| philistine wrote:
| The creator can maintain anonymity. The creator does not
| deserve to continue being celebrated when they embarked on a
| DDOS campaign using the traffic of archive.is against a
| journalist trying to uncover their identity. By these actions,
| they have shown to be capricious, vindictive, and willing to
| ensnare their users in their DDOS of others. Whoever they are,
| they're terrible.
| gzread wrote:
| Their life is in danger and one particular journalist is
| making it so
| choo-t wrote:
| Well, if they deserve anonymity, they also deserve to be able
| to protect it, and they have really few tools against a
| doxxing, the DDOS was one of them, corrupting the archived
| article was another, albeit dangerous for their own
| reputation as an archiver.
|
| The crux of the problem was the doxxing, not the defense
| against it.
| ajam1507 wrote:
| You don't think leveraging your site to DDOS someone is a
| problem?
|
| Do people not also deserve to be protected from being
| DDOSed? Do people also not deserve to not have their
| internet traffic be used to DDOS someone?
| choo-t wrote:
| > You don't think leveraging your site to DDOS someone is
| a problem?
|
| It is, but it's one of the only tools they have to
| prevent the doxxing site to being reachable.
|
| > Do people not also deserve to be protected from being
| DDOSed?
|
| You mean the person doing the doing should be protected ?
|
| >Do people also not deserve to not have their internet
| traffic be used to DDOS someone?
|
| Yes, it should have been opt-in. But unless you doesn't
| run JS, you kinda give right to the website you visit to
| run arbitrary code anyway.
| kpcyrd wrote:
| You don't think non-consensually revealing somebody's
| identity is a problem?
|
| Resorting to DDoS is not pretty, but "why is my violent
| behavior met with violence" is a little oblivious and
| reversal of victim and perpetrator roles.
| ajam1507 wrote:
| > You don't think non-consensually revealing somebody's
| identity is a problem?
|
| I do think it's a problem. You are the only one excusing
| bad behavior here.
| psychoslave wrote:
| Not defending any party, it's basic ethological
| expectation: a creature that try to beat an other should
| expect aggressive response in return.
|
| Of course, never aggressing anyone and transform any
| aggression agaisnt self into an opportunity to
| acculturate the aggressor into someone with the same
| empathic behavior is a paragon of virtuous entity. But
| paragons of virtue is not the median norm, by definition.
| ajam1507 wrote:
| > Not defending any party, it's basic ethological
| expectation: a creature that try to beat an other should
| expect aggressive response in return.
|
| Another basic ethological expectation is that the strong
| dominate the weak, but maybe we shouldn't base our moral
| framework around how things are, and rather on how they
| should be.
| staticassertion wrote:
| I think this is a weak framing. Lots of things are moral
| or immoral under specific circumstances. We should
| protect people from being murdered. I think murder is
| usually wrong. But we also likely agree that there are
| circumstances in which killing someone can be justified.
| If we can find context for taking a life, I'm quite sure
| we can find context for a DoS.
| ajam1507 wrote:
| And what's the context for using the internet traffic of
| your unsuspecting users to accomplish this?
| choo-t wrote:
| Using the internet trafic of the persons using your
| service to protect your anonymity and thus, protecting
| the service itself.
| ajam1507 wrote:
| So you shouldn't have to inform your users that their
| traffic will be used in a cyberattack?
| RobotToaster wrote:
| In most jurisdictions informing them would potentially
| make them legally liable. The fact they had no knowledge
| shields them from liability.
| ajam1507 wrote:
| So their desire to not be used to commit a cyberattack
| doesn't factor in? As long as they aren't legally liable,
| it doesn't matter?
|
| Also a checkbox that says something like "I would like to
| help commit a crime using my internet traffic" would keep
| people from having their traffic used without consent.
| ryandrake wrote:
| Unfortunately "consent" is a difficult to understand
| concept for a lot of the web and Silicon Valley.
| staticassertion wrote:
| I don't have strong feelings about that one way or the
| other, honestly.
| RobotToaster wrote:
| There's an old legal maxim "in pari delicto potior est
| conditio defendentis", that is "in a case of mutual fault
| the position of the defending party is the better one."
| ajam1507 wrote:
| That works better when there is a defendant.
| mikkupikku wrote:
| People do not ever have any sort of moral or natural
| right to not get hit after starting shit.
| ajam1507 wrote:
| Even if this were true, this does not justify any
| particular type of action, except maybe an in kind
| response.
|
| For example, would they have been justified to murder the
| blogger?
| Obscurity4340 wrote:
| I had no idea that was the actual situation (journalist
| trying to hunt them down). Sorta changes the moral calculus,
| I'll allow it
| MSFT_Edging wrote:
| If there's ever something a journalist would never ever do,
| it's destroy someone's life for a headline. Never ever.
| Totally impossible.
| rdevilla wrote:
| This is great. Journalists are impeding the preservation of
| the historical record by blocking archivist traffic while
| simultaneously manhunting those archivists who find ways
| around their authwalls.
|
| Soon the news and the historical facts will be unnecessary.
| You can simply receive your wisdom from the AIs, which, as
| nondeterministic systems, are free to change the facts at
| will.
| Permit wrote:
| >This is great. Journalists are impeding the preservation
| of the historical record by blocking archivist traffic
| while simultaneously manhunting those archivists who find
| ways around their authwalls.
|
| You are deliberately misrepresenting the situation. The
| journalists who block archivist traffic are not in any way
| connected to the blogger who was attempting to investigate
| the creator of archive.is. You have portrayed them as
| related in an attempt to garner sympathy for the creator of
| archive.is.
|
| Here is an account of the facts:
| https://gyrovague.com/2026/02/01/archive-today-is-
| directing-...
| ThoAppelsin wrote:
| Thanks for this. I didn't know about the details, and
| there are probably mor... but this gyrovague person is
| clearly being a privileged trouble. Their "boringly
| straightforward curiosity" is an admittance of their
| shallow thinking. When you are pointed out that you're
| hurting someone in some respect that you weren't
| intentional about, you should stop, sit down, and
| reconsider everything in that respect.
|
| You may end up deciding to continue inflicting harm,
| intentionally so this time---that is a perfectly valid
| course to take. But you _cannot_ anymore remain
| unintentional about it.
| ImPostingOnHN wrote:
| _> When you are pointed out that you're hurting someone
| in some respect that you weren't intentional about, you
| should stop, sit down, and reconsider everything in that
| respect._
|
| _> You may end up deciding to continue inflicting harm,
| intentionally so this time---that is a perfectly valid
| course to take. But you cannot anymore remain
| unintentional about it._
|
| To be clear, are you talking about the harm of commanding
| a botnet (which includes you and me) to attack an
| investigative journalist for investigatively journaling?
| freedomben wrote:
| Indeed. I am highly supportive of archive.is, but let's
| remember that _he hijacked his own users to become a bot
| net_. That should make all us hackers furious. Is a
| complete violation of trust. Our residential IPs were
| used to attack someone, meaning he put us all at risk for
| his own personal goals. It 's disgusting behavior and he
| should be called out for it. But we should also realize
| he's offering an important and free service to us all. I
| support him, but this is not something we should just
| ignore. Trust is very important.
| charcircuit wrote:
| Review the definition of botnet. That is not what was
| done.
| heavyset_go wrote:
| I didn't think I was going to side with the DDoS-er, but
| considering what happened with Aaron Schwartz, that
| blogger was trying to get them killed or put in a box
| forever.
| staticassertion wrote:
| They're terrible for not wanting to be dox'd?
| philistine wrote:
| They're terrible for turning all of us into parts of a
| botnet DDOS someone doing their job. I don't understand how
| DDOS is the correct tool for anyone to protect their
| anonymity.
| 8cvor6j844qw_d6 wrote:
| Agreed, and if archive.is goes down, archive.org becomes the de
| facto monopoly in web archival.
|
| That's a problem because archive.org honors removal requests
| from site owners. Buy an old domain and you can theoretically
| wipe its archived history clean.
| charcircuit wrote:
| Alternatively, it leaves a vacuum for an archive site that
| doesn't take things down like archive.org to exist and a new
| one takes its place as the defacto one.
| tossandthrow wrote:
| I think media outlets think way too highly of their contribution
| to AI.
|
| Had they never existed, it had likely not made a dent to the AI
| development - completely like believing that had they been twice
| as productive, it had likely neither made a dent to the quality
| of LLMs.
| Freak_NL wrote:
| How do you think those models get trained? You can only get so
| far with Wikipedia, Reddit, and non-fiction works like books
| and academic papers.
| RugnirViking wrote:
| How does the entire textual corpus of say, new York times
| compare to all novels? Each article is a page of text, maybe
| two at most? There certainly are an awful lot of articles.
| But it's hard to imagine it is much more than a couple
| hundred novels. There must be thousands of novels released
| each year
| Freak_NL wrote:
| Like apples to oranges.
|
| LLMs are (apparently) massively used to get information
| about topics in the real world. Novels aren't going to be
| much help there. Journalism, particularly in written form,
| provides a fount of facts presented from different angles,
| as well as opinions, and it was all there free for the
| taking...
|
| Wikipedia provides the scantest summary of that, fora and
| social media give you banter, fake news, summaries of news,
| and a whole lot of shaky opinions, at best. Novels give you
| the foundations of language, but in terms of knowledge
| nothing much beyond what the novel is about.
| olalonde wrote:
| LLMs can get up to date information from primary sources
| - no journalists required.
| freedomben wrote:
| Primary sources can and often are, very biased.
| Journalists are (supposed to be) doing fact checks and
| gathering multiple sources from all sides. Modern
| journalism is in a terrible state, but still important.
|
| Imagine if all info about Facebook came from Facebook...
| PopAlongKid wrote:
| I don't understand how LLMs can ask questions at a press
| conference.
| olalonde wrote:
| Startup idea right there.
| AnthonyMouse wrote:
| To begin with, your premise is that the only primary
| sources are press conferences and that press conferences
| only provide information in response to questions.
|
| But even taking it literally, isn't that one of the
| things LLMs could actually do? You're essentially asking
| how a text generator could generate text. The real
| question is whether the questions would be any good, but
| the answer isn't necessarily no.
| none2585 wrote:
| I don't think an LLM can have secret human sources that
| provide them with confidential information anonymously.
| Not all news shows up on Twitter.
| miki123211 wrote:
| You don't need the secret human sources any more.
|
| You used to need them, because journalists had the
| distribution and the sources didn't. In a word of printed
| newspapers, you couldn't get your story distributed
| nationally (much less worldwide) without the help of a
| journalist, doubly so if you wanted to stay anonymous.
|
| Nowadays, you just make a Substack and there's that.
|
| See that recent expose on the Delve fraud as just one
| example. No journalists were harmed in the making of that
| article.
| ajam1507 wrote:
| The primary source for most news is journalism.
| NiloCK wrote:
| In context, _primary source_ means the subject of the
| article (the thing the journalist is writing about).
|
| Journalism is by definition a secondary source.
| (Notwithstanding edge cases like articles reporting
| directly on the news industry itself.)
| ajam1507 wrote:
| Journalism is absolutely not by definiton a secondary
| source.
|
| If a journalist is on location covering a flood, for
| example, they are the primary source.
|
| A journalist conducting an interview would also be a
| primary source.
| tossandthrow wrote:
| Have a look at this article: https://www.washingtonpost.com/t
| echnology/interactive/2023/a...
|
| NY Times is 0.06% of common crawl.
|
| These news media outlets provide a drop in the ocean worth of
| information. Both qualitatively and quantitatively.
|
| The news / media industry is really just trying to hold on to
| their lifeboat before inevitably becoming entirely
| irrelevant.
|
| (I do find this sad, but it is like the reality - I can
| already now get considerably better journalism using LLMs
| than actual journalists - both click bait stuff and high
| quality stuff)
| pimlottc wrote:
| That seems like a reductive way to consider it. What
| percent of music was created by Led Zeppelin? What percent
| of art was painted by Monet? What percent of films by
| Alfred Hitchcock? It may be a small percentage objectively
| but they are hugely influential.
| tossandthrow wrote:
| I don't think back propagation care whose text it is back
| propagating.
| NiloCK wrote:
| The data sets aren't naively fed into the training runs.
|
| Instead, training attempts to sample more heavily from
| higher quality sources, with, I'm sure, a mix of manual
| and heuristic labeling.
| ffsm8 wrote:
| fwiw, no llm ive ever used generated in the writing style
| newspapers and -sites use - hence i honestly doubt
| they've been given a meaningful boost in relevancy.
|
| their idioms would leak occasionally otherwise
| Gigachad wrote:
| 90% of common crawl is complete junk. While the tiny bit of
| news articles powers almost all the ai answers in Google
| search.
| datsci_est_2015 wrote:
| How many Reddit, HN, etc. posts are based on NYT articles?
| How many derivative news articles, blog posts, YouTube
| videos, TikToks, etc. are responses to those articles?
|
| At least NYT is probably on the correct side of Sturgeon's
| Law: https://en.wikipedia.org/wiki/Sturgeon%27s_law
| AnthonyMouse wrote:
| > How many Reddit, HN, etc. posts are based on NYT
| articles? How many derivative news articles, blog posts,
| YouTube videos, TikToks, etc. are responses to those
| articles?
|
| You may get an inconvenient answer when you ask the
| question the other way around.
| Melatonic wrote:
| 0.06% is way higher than I would expect
| phatfish wrote:
| Isn't the non-LLM generated text becoming more valuable for
| training as the web at large is flooded with slop?
|
| Preventing new human generated text from being used by AI firms
| (without consent) seems like a valid strategy.
| tossandthrow wrote:
| No.
|
| Modern LLMs are trained on a large percentage of synthetic
| data.
|
| This sentiment is largely legacy (even though just a couple
| of years old).
| Havoc wrote:
| As someone perpetually online it's also making me rethink that a
| bit
|
| Unless you love walled gardens, doomscrolling and endless AI slop
| that seems like the fun is over
| stuaxo wrote:
| The New York Times is awful I want it to be archived so people
| can see that in the future.
| Archonical wrote:
| I don't read it. Why is it awful?
| lyu07282 wrote:
| From Manufacturing Consent:
|
| > by selection of topics, by distribution of concerns, by
| emphasis and framing of issues, by filtering of information,
| by bounding of debate within certain limits. They determine,
| they select, they shape, they control, they restrict -- in
| order to serve the interests of dominant, elite groups in the
| society."
|
| > "history is what appears in The New York Times archives;
| the place where people will go to find out what happened is
| The New York Times. Therefore it's extremely important if
| history is going to be shaped in an appropriate way, that
| certain things appear, certain things not appear, certain
| questions be asked, other questions be ignored, and that
| issues be framed in a particular fashion."
|
| The propaganda in the New York times is especially precious
| because of how highly respected it is, there never was a war
| or other elite interest they didn't push along.
| mikkupikku wrote:
| They have a very long track record of pretending to be
| independent but actually toeing the government's line at key
| pivotal moments in history when an independent newspaper is
| needed the most. Everybody here knows how they helped start
| the second Iraq war I hope, but that wasn't a one-off fluke.
| Go back through the major wars in American history and you
| can find the New York Times championing the cause of war
| before each of these. World Was 2, they uncritically accepted
| Walter Durranty letting Stalin ghostwrite for him,
| specifically w.r.t. Stalin's man-made famine in Ukraine,
| because America was allied with Stalin. WWI, frequent
| editorializing of Germans being wild Asiatic savages while
| the Anglos were good and noble people that Americans owed
| something to for some reason nobody could explain. Vietnam,
| they uncritically accepted government reports on the second
| Gulf of Tonkin incident _which never happened_ and broadly
| accepted the governments own reports about how the war was
| going, at least in the early years when it still might have
| been possible to avoid further engagement. Korean war, they
| supported the government narrative of communist containment.
| First Iraq War, they uncritically reported very dubious
| atrocity propaganda, like the fraudulent "Nayirah testimony"
| given by the teenage daughter of a diplomat pretending to be
| a politically uninvolved hospital worker.
|
| The pattern here is deference to official narratives at
| precisely the times when criticism is needed the most.
| martey wrote:
| > _World Was 2, they uncritically accepted Walter Durranty
| letting Stalin ghostwrite for him, specifically w.r.t.
| Stalin 's man-made famine in Ukraine, because America was
| allied with Stalin._
|
| Duranty's New York Times articles were written in 1931, a
| decade before America entered World War II. They not only
| predate an American alliance with the Soviet Union, but
| they also predate the United States having any diplomatic
| relations with the Soviet Union whatsoever.
|
| > _Go back through the major wars in American history and
| you can find the New York Times championing the cause of
| war before each of these._
|
| Are there other major American newspapers who have a
| history of dissenting against war? Wasn't the New York
| Times' behavior in most of the conflicts you mention in
| line with American popular opinion?
| mikkupikku wrote:
| The American political apparatus was already normalizing
| relations with the Soviet Union due to the Japanese
| invasion of Manchuria (1931, which is when WW2 truly
| started), due to the great depression in America making
| alliance with the Soviets look economically advantageous
| for America, and due to political instability in Germany
| and Italy. There was a strong sense of shit hitting the
| fan soon and that America would be with the Soviet Union
| through it. FDR officially recognized the Soviet Union in
| 1933, during the peak of Stalin's famine in Ukraine,
| which the New York Times was actively denying.
|
| As for other newspapers, the Times isn't worse but bears
| the brunt of the criticism because they are after all
| America's foremost, most influential newspaper.
| martey wrote:
| Your comment is full of historical revisionism. The
| Second World War has little or nothing to do with the
| Holodomor. The Times' lack of reporting on it has nothing
| to do with American foreign policy (both Duranty and
| Gareth Jones were British) and everything to do with
| credulous reporters. The idea that America and the Soviet
| Union would be natural allies was not the majority
| viewpoint in the 1930s (outside of American communist
| propaganda) and is clearly disproved by the Molotov-
| Ribbentrop Pact.
| lyu07282 wrote:
| > Wasn't the New York Times' behavior in most of the
| conflicts you mention in line with American popular
| opinion?
|
| Dear god, what? I love the unintentional satire its so
| funny. "Its fine if the media lies to the people if the
| people believe the lies." That's low even for this
| stemlord dumpsterfire of a platform
| martey wrote:
| > _" Its fine if the media lies to the people if the
| people believe the lies."_
|
| That is low, but that's neither a direct quote or not an
| accurate paraphrase of my comment. While I realize that
| the comment I replied was edited after my response to
| talk about lying in more recent conflicts (which might be
| causing your confusion), I don't think you (like OP) are
| trying to make the argument that the New York Times is
| bad because of their reporting in the 1930s.
| martey wrote:
| It's bad etiquette to edit your comment after people have
| replied to it without showing what your edits were. Please
| do not do this.
| gsky wrote:
| All media opinion articles are nothing but propaganda pieces.
| Every media out only allows those aligned with their ideology
| to write those pieces
| VladVladikoff wrote:
| As a site operator who has been battling with the influx of
| extremely aggressive AI crawlers, I'm now wondering if my tactics
| have accidentally blocked internet archive. I am totally ok with
| them scraping my site, they would likely obey robots.txt, but
| these days even Facebook ignores it, and exceeds my stipulated
| crawl delay by distributing their traffic across many IPs. (I
| even have a special nginx rule just for Facebook.)
|
| Blocking certain JA3 hashes has so far been the most effective
| counter measures. However I wish there was an nginx wrapper
| around hugin-net that could help me do TCP fingerprinting as
| well. As I do not know rust and feel terrified of asking an LLM
| to make it. There is also a race condition issue with that
| approach, as it is passive fingerprinting even the JA4 hashes
| won't be available for the first connection, and the AI crawlers
| I've seen do one request per IP so you don't get a chance to
| block the second request (never happens).
| mycall wrote:
| Evasion techniques like JA3 randomization or impersonation can
| bypass detection.
| VladVladikoff wrote:
| I am aware, fortunately I haven't seen much of this... yet.
| Also JA4 is supposed to be a bit less vulnerable to this.
| Also this is why I really want TCP and HTTP fingerprinting.
| But the best i've found so far is
| https://github.com/biandratti/huginn-net and is only
| available as rust library, I really need it as an nginx
| module. I've been tempted to try to vibe code an nginx module
| that wraps this library.
| andrepd wrote:
| I wonder if it would be practical to have bot-blocking measures
| that can be bypassed with a signature from a set of whitelisted
| keys... In this case the server would be happy to allow
| Internet Archive crawlers.
| freedomben wrote:
| That's an interesting idea. Mtls could probably be used for
| this pretty easily. It would require IA to support it if
| course, but could be a nice solution. I wonder, do they
| already support it? I might throw up a test...
| danrl wrote:
| > they would likely obey robots.txt
|
| If only... Despite providing a useful service, they are not as
| nice towards site owners as one would hope.
|
| Internet Archive says:
|
| > We see the future of web archiving relying less on robots.txt
| file declarations geared toward search engines
|
| https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
|
| They are not alone in that. The "Archiveteam", a different
| organization, not to be confused with archive.org, also doesn't
| respect robots.txt according to their wiki:
| https://wiki.archiveteam.org/index.php?title=Robots.txt
|
| I think it is safe to say that there is little consideration
| for site owners from the largest archiving organizations today.
| Whether there should be is a different debate.
| sunaookami wrote:
| What an absolutely insufferable explanation from ArchiveTeam.
| What else do you expect from an organization aggressively
| crawling websites and bringing them down to their knees
| because they couldn't care less?
| rossng wrote:
| I'm curious to hear about examples of where this has
| happened. Because ArchiveTeam also has an important role in
| rescuing cultural artefacts that have been taken into
| private hands and then negligently destroyed.
| tredre3 wrote:
| Having a laudable goal doesn't absolve them from bad
| behavior.
| tech234a wrote:
| That page was written by Jason Scott in 2011 and has barely
| been changed since then.
| wlonkly wrote:
| ArchiveTeam (which is not the Internet Archive)
| aggressively crawls websites because they care _a lot_ ,
| because the website in question is about to go away.
|
| Heck, I'd say as caring goes, ArchiveTeam cares more than
| the owners of the website, because in the ideal shutdown,
| the owners provide the data instead of forcing people to
| scrape it if they want to retain it after the site shuts
| down.
| AnthonyMouse wrote:
| It seems like the general problem is that the original common
| usage of robots.txt was to identify the parts of a site that
| would lead a recursive crawler into an infinite forest of
| dynamically generated links, which nobody wants, but it's
| increasingly being used to disallow the fixed content of the
| site which is the thing they're trying to archive and which
| shouldn't be a problem for the site when the bot is caching
| the result so it only ever downloads it once. And more sites
| doing the latter makes it hard for anyone to distinguish it
| from the former, which is bad for everyone.
|
| > The "Archiveteam", a different organization, not to be
| confused with archive.org, also doesn't respect robots.txt
| according to their wiki
|
| "Archiveteam" exists in a different context. Their usual
| purpose is to get a copy of something _quickly_ because it 's
| expected to go offline soon. This both a) makes it irrelevant
| for ordinary sites in ordinary times and b) gives the ones
| about to shut down an obvious thing to do, i.e. just give
| them a better/more efficient way to make a full archive of
| the site you're about to shut down.
| b1n wrote:
| Archive now, make public after X amount of time. So, maybe both
| publisher and archiver are happy (or less sad).
| catapart wrote:
| I'm seeing a lot of comments about how we maintain the status
| quo, but I'm very interested in hearing from anyone who has
| conceded that there is no way to stop AI scrapers at this point
| and what that means for how we maintain public information on the
| internet in the future.
|
| I don't necessarily believe that we won't find some half-
| successful solution that will allow server hosting to be done as
| it currently is, but I'm not very sure that I'll want to
| participate in whatever schemes come about from it, so I'm
| thinking more about how I can avoid those schemes rather than
| insisting that they won't exist/work.
|
| The prevailing thought is that if it's not possible now, it won't
| be long before a human browser will be indistinguishable from an
| LLM agent. They can start a GUI session, open a browser, navigate
| to your page, snapshot from the OS level and backwork your
| content from the snapshot, or use the browser dev tools or
| whatever to scrape your page that way. And yes, that would be
| much slower and more inefficient than what they currently do, but
| they would only need to do that for those that keep on the
| bleeding edge of security from AI. For everyone else, you're in a
| security race against highly-paid interests. So the idea of
| having something on the public internet that you can stop people
| from archiving (for whatever purpose they want) seems like it's
| soon to be an old-fashioned one.
|
| So, taking it as a given that you can't stop what these people
| are currently trying to stop (without a legislative solution and
| an enforcement mechanism): how can we make scraping less of a
| burden on individual hosts? Is this thing going to coalesce into
| centralizing "archiving" authorities that people trust to archive
| things, and serve as a much more structured and friendly way for
| LLMs to scrape? Or is it more likely someone will come up with a
| way to punish LLMs or their hosts for "bad" behavior? Or am I
| completely off base? Is anyone actually discussing this? And, if
| so, what's on the table?
| titzer wrote:
| You're going to hate this, but one answer might be blockchain.
| A crytographically strong, attestable public record of
| appending information to a shared repository. Combined with
| cryptographic signatures for humans, it's basically a secure,
| open git repository for human knowledge.
| catapart wrote:
| Sounds interesting, but I guess I'm a little unsure of how to
| connect the dots? Are you suggesting that websites would be
| hosted on a blockchain and browsed by human-signed browsers?
| Or more like there would be a blockchain authority, which
| server hosts could query to determine if a signature,
| provided by their browser, is human? Would you mind painting
| the picture in a little more detail?
| echelon wrote:
| We're rarely going to need to attest anything is "real" or
| "human". It's basically only going to matter in civil and
| criminal court, and IDV.
|
| We don't need to attest signals are analogue vs. digital. The
| world is going to adapt to the use of Gen AI in everything.
| The future of art, communications, and productivity will all
| be rooted in these tools.
| sharperguy wrote:
| You can have cryptographically signed data caches without the
| need for a blockchain. What a blockchain can add is the
| ability to say that a particular piece of data must have
| existed before a given date, by including the hash of that
| data somewhere in the chain.
| techjamie wrote:
| > Combined with cryptographic signatures for humans
|
| What happens when the human gives an agent access to said
| signature? Then you fall back on traditional anti-bot
| techniques and you're right back where you started.
| jakeydus wrote:
| DNA/biometrics are the only secure future!
|
| I joke, but there are those out there who don't.
| amarant wrote:
| You'd spend less compute just serving the crawlers than
| maintaining the Blockchain.
|
| Like, 3 orders of magnitude less compute, conservatively
| counting.
| heavyset_go wrote:
| If you don't publish content to the public web anymore, you
| don't have to worry traffic or scraping or bots
|
| Maybe it'll just be cheaper for CDNs or whatever to sell the
| data they serve directly instead of doing extra steps with
| scraping
| miki123211 wrote:
| The only answer is WebDRM.
|
| It's easy to pretend you're human, it's hard to pretend that
| you have a valid cryptographic signature for Google which
| attests that your hardware is Google-approved.
|
| Crawling is the price we pay for the web's openness.
| realusername wrote:
| It's not hard to bypass attestation, it's actually very
| easy and done right now at scale, there's giant click farms
| with phones on racks.
|
| They don't modify any device and will pass whatever
| attestation you try to make.
| eikenberry wrote:
| I think this is what will happen. That the public internet
| will become the place you go to seed the data you want to the
| scrapers and you will use a private internet for everything
| else. Private sites, private feeds, mesh networks, etc. We're
| basically going back in time similar to when AOL and friends
| had their own private networks for their members.
| suzzer99 wrote:
| I don't see this is a permanent problem. Right now there must
| be 1000s of well-funded AI companies trying to scrape the
| entire internet. Eventually the AI equity bubble will pop and
| there will be consolidation. If every player left has already
| scanned the web, will they need to keep constantly scanning it?
| Seems like no. Even if they do, there will be a lot less of
| them.
| kdheiwns wrote:
| The current trend is that it's getting cheaper and easier to
| roll out your own AI on your own computer, so more and more
| people will do it as a hobby. Even if the big players die
| out, some dude with a decent gaming PC could decide to start
| scraping everything pertaining to their interests just for
| the hell of it. Every government with a budget and someone
| capable of doing the job will surely get in on it as well.
| overfeed wrote:
| > some dude with a decent gaming PC could decide to start
| scraping everything pertaining to their interests just for
| the hell of it.
|
| Not from their single residential IP, they are not.
|
| If they do succeed[1] - it is not going to be at hundreds
| or thousands of requests per second that the current AI
| scrapers bombard servers with. Some dude at home will, at
| best, be putting 4-6 orders of magnitude less strain on a
| limited set of servers.
|
| 1. Scraping is an arms race: if you're just "some dude" at
| the skill floor - you're going to have a bad time whether
| you're scraping, or defending against scrapers.
| ronsor wrote:
| > without a legislative solution and an enforcement mechanism
|
| If there's one thing people, especially HN users, should've
| learned by now, it's that there's no enforcement mechanism
| worth a damn for Internet legislation when incentives don't
| align.
| AnthonyMouse wrote:
| > how can we make scraping less of a burden on individual
| hosts?
|
| Isn't this basically what content-addressable storage is for?
| Have the site provide the content hashes rather than the
| content and then put the content on IPFS/BitTorrent/whatever
| where the bots can get it from each other instead of bothering
| the site.
|
| Extra points if you can get popular browsers to implement
| support for this, since it also makes it a lot harder to censor
| things and a decent implementation (i.e. one that prefers
| closer sources/caches) would give most of the internet the
| efficiency benefits of a CDN without the centralization.
| zer00eyz wrote:
| > anyone who has conceded that there is no way to stop AI
| scrapers at this point and what that means for how we maintain
| public information on the internet in the future.
|
| Bloat, and bandwidth costs are the real problems here. Every
| one seems to have forgotten basics of engineering and
| accounting.
| rkwtr1299 wrote:
| The EFF has a lukewarm stance on AI, but criticizes everyone
| else. AI is clearly ruining the Internet and the job market.
|
| How about thinking about your mission and take an anti-AI
| hardliner stance? But I see multiple corporate sponsors that
| would not be pleased:
|
| https://www.eff.org/thanks
|
| All these so called freedom organizations like the OSI and the
| EFF have been bought and are entirely irrelevant if not harmful.
| rdiddly wrote:
| When you disappear from the historical record, that's called you
| becoming irrelevant. The world moves on, and pays attention to
| someone else. Not sure why the Times doesn't seem to see this
| angle.
| lich_king wrote:
| I am really tired of this kind of moralizing. The reality is that
| every time geeks come up with some utopian ideal, such as that we
| should publish all our software under free licenses or make all
| human knowledge freely accessible to anyone, _the same geeks
| later show up and build extractive industries on top of this_. Be
| a part of the open source revolution... so that you do unpaid
| labor for Facebook. Make a quirky homepage... so that we can
| bootstrap global-scale face recognition tech. Help us build the
| modern-day library of Alexandria... so that OpenAI and Anthropic
| can sell it back to you in a convenient squeezable tube.
|
| Maybe it's time to admit that the techie community has a pretty
| bad moral compass and that we're not good stewards of the world's
| knowledge. We turn lofty ideals into amoral money-making schemes
| whenever we can. I'm not sure that the EFF's role in this is all
| that positive. They come from a good place, but they ultimately
| aid a morally bankrupt industry. I don't want archive.org to
| retain a copy of everyone's online footprint because I know it be
| used the same way it always is: to make money off other people's
| labor and to and erode privacy.
| Peritract wrote:
| Agreed; again and again, we see that the utopian ideals of the
| tech world are only the ones that let them extract value
| without consideration.
| alexpotato wrote:
| As someone who did a lot of work on early spam fighting only to
| see it replaced by things like DKIM, I wonder if we are going to
| start having the "taxi medallion" style approach but for people
| connecting to your site.
|
| e.g. IA will publish out signed https requests with their key so
| you, as the site owner, can confirm that it is indeed from them
| and not from AI.
|
| Feels like that would be very anti open internet but not sure how
| else you would prove who is a good actor vs not (from your
| perspective that is).
| m3047 wrote:
| I'll tell you what I expect to see from crawlers, agents and
| which I'm enforcing on everybody who doesn't look distinctly
| human:
|
| * Reverse DNS which points to a web site which has a
| discoverable / well-known page which clearly describes their
| behavior.
|
| * Some sort of reverse IP based, RBL and SPF -inspired TXT
| records which describe who, what, when, why, how, how often
|
| so that I can make automated decisions based on it.
|
| Yah, I don't have a lot of crawlers that I welcome... but I'm
| building a pretty good database of the worst offenders. At
| scale... there are advantages to scale which work in my favor,
| actually.
|
| I documented this at the end of a blog post when I made
| blocking Amazon incoming requests a default policy several
| years ago.
| ashwinnair99 wrote:
| We're essentially burning the library to punish the arsonist. The
| arsonist already left.
| tremon wrote:
| What do you mean, "the arsonist already left"? Isn't it more
| accurate to say that 90% of the library's visitors are
| arsonists?
| pamcake wrote:
| It is not accurate. A very small number of actors pose as
| many and make up the majority of traffic. For example, your
| User-Agent block may cut traffic by 10%, 99% of which
| malicious - but you blocked 1000 individuals, only 1 of which
| malicious.
| phendrenad2 wrote:
| Does IA use a known set of IPs? Should be trivial to let them
| through. But yeah, news companies aren't technically capable of
| this kind of finesse, they probably have by-the-hour contractors
| doing any coding/config changes, and closing the ticket is the
| goal there.
| neilv wrote:
| I'm now an AI bro, and a long-time fan of the EFF (though they
| occasionally make a mistake).
|
| I think this EFF piece could be more forthright (rather than
| political persuasion), since the matter involves balancing
| multiple public interest goals that are currently in opposition.
|
| > _Organizations like the Internet Archive are not building
| commercial AI systems._
|
| This NiemanLab article lists evidence that Internet Archive
| explicitly encouraged crawling of their data, which was used for
| training major commercial AI models:
|
| | News publishers limit Internet Archive access due to AI
| scraping concerns (niemanlab.org) | 569 points by ninjagoo 34
| days ago | 366 comments |
| https://news.ycombinator.com/item?id=47017138
|
| > _[...] over a fight that libraries like the Archive didn 't
| start, and didn't ask for._
|
| They started or stumbled into this fight through their actions.
| And (ideology?) they also started and asked for a related fight,
| about disregard of copyright and exploitation of creators:
|
| | Internet Archive forced to remove 500k books after publishers'
| court win (arstechnica.com) | 530 points by cratermoon on June
| 21, 2024 | 564 comments |
| https://news.ycombinator.com/item?id=40754229
| charcircuit wrote:
| The EFF is being obtuse. Using archives sites is a known bypass
| for reading news articles for free. Every time a paywalled site
| someone posts an archive link so others can read for free.
|
| >Archiving and Search Are Legal
|
| But giving full articles away for free to everyone is not.
| Archive.org has the power to make archives private.
| m3047 wrote:
| If you're selling ammonium nitrate and diesel, it's a reasonable
| presumption that you're in the agricultural supply business. It's
| also reasonable to expect you not to sell a truckload of both to
| someone who you don't know to be a farmer.
___________________________________________________________________
(page generated 2026-03-21 23:00 UTC)